short text
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Meta-Complementing the Semantics of Short Texts in Neural Topic Models
Topic models infer latent topic distributions based on observed word co-occurrences in a text corpus. While typically a corpus contains documents of variable lengths, most previous topic models treat documents of different lengths uniformly, assuming that each document is sufficiently informative. However, shorter documents may have only a few word co-occurrences, resulting in inferior topic quality. Some other previous works assume that all documents are short, and leverage external auxiliary data, e.g., pretrained word embeddings and document connectivity. Orthogonal to existing works, we remedy this problem within the corpus itself by proposing a Meta-Complement Topic Model, which improves topic quality of short texts by transferring the semantic knowledge learned on long documents to complement semantically limited short texts. As a self-contained module, our framework is agnostic to auxiliary data and can be further improved by flexibly integrating them into our framework. Specifically, when incorporating document connectivity, we further extend our framework to complement documents with limited edges. Experiments demonstrate the advantage of our framework.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Asia > Singapore (0.05)
- Asia > Middle East > Jordan (0.04)
Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts
Neumann, Julius, Lange, Robert, Susanti, Yuni, Färber, Michael
Sentiment classification in short text datasets faces significant challenges such as class imbalance, limited training samples, and the inherent subjectivity of sentiment labels -- issues that are further intensified by the limited context in short texts. These factors make it difficult to resolve ambiguity and exacerbate data sparsity, hindering effective learning. In this paper, we evaluate the effectiveness of small Transformer-based models (i.e., BERT and RoBERTa, with fewer than 1 billion parameters) for multi-label sentiment classification, with a particular focus on short-text settings. Specifically, we evaluated three key factors influencing model performance: (1) continued domain-specific pre-training, (2) data augmentation using automatically generated examples, specifically generative data augmentation, and (3) architectural variations of the classification head. Our experiment results show that data augmentation improves classification performance, while continued pre-training on augmented datasets can introduce noise rather than boost accuracy. Furthermore, we confirm that modifications to the classification head yield only marginal benefits. These findings provide practical guidance for optimizing BERT-based models in resource-constrained settings and refining strategies for sentiment classification in short-text datasets.
- North America > United States > Maryland > Baltimore (0.04)
- Europe > Germany > Saxony > Dresden (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
IOCC: Aligning Semantic and Cluster Centers for Few-shot Short Text Clustering
Yin, Jixuan, Yao, Zhihao, Huo, Wenshuai, Yu, Xinmiao, Feng, Xiaocheng, Li, Bo
In clustering tasks, it is essential to structure the feature space into clear, well-separated distributions. However, because short text representations have limited expressiveness, conventional methods struggle to identify cluster centers that truly capture each category's underlying semantics, causing the representations to be optimized in suboptimal directions. To address this issue, we propose IOCC, a novel few-shot contrastive learning method that achieves alignment between the cluster centers and the semantic centers. IOCC consists of two key modules: Interaction-enhanced Optimal Transport (IEOT) and Center-aware Contrastive Learning (CACL). Specifically, IEOT incorporates semantic interactions between individual samples into the conventional optimal transport problem, and generate pseudo-labels. Based on these pseudo-labels, we aggregate high-confidence samples to construct pseudo-centers that approximate the semantic centers. Next, CACL optimizes text representations toward their corresponding pseudo-centers. As training progresses, the collaboration between the two modules gradually reduces the gap between cluster centers and semantic centers. Therefore, the model will learn a high-quality distribution, improving clustering performance. Extensive experiments on eight benchmark datasets show that IOCC outperforms previous methods, achieving up to 7.34\% improvement on challenging Biomedical dataset and also excelling in clustering stability and efficiency. The code is available at: https://anonymous.4open.science/r/IOCC-C438.
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Saarland > Saarbrücken (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
An Enhanced Model-based Approach for Short Text Clustering
Cheng, Enhao, Zhang, Shoujia, Yin, Jianhua, Song, Xuemeng, Gan, Tian, Nie, Liqiang
--Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. T o address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster . Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adap-tively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at https://github.com/chehaoa/VEMC. HE proliferation of mobile internet has led to an exponential increase in user-generated data on online platforms, including video, text, and image data. Intelligent processing of such data can significantly enhance the quality of life across society and generate substantial economic benefits. Short text data are a prevalent and important form of user-generated data, consisting of concise texts such as microblogs and comments.
- Asia > Middle East > Jordan (0.04)
- Asia > China > Shandong Province > Qingdao (0.04)
- Asia > China > Hong Kong (0.04)
- (13 more...)
- Research Report > New Finding (0.66)
- Research Report > Promising Solution (0.48)
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text
Chen, Weizhi, Chen, Jingbo, Deng, Yupeng, Chen, Jiansheng, Feng, Yuman, Xi, Zhihao, Liu, Diyou, Li, Kai, Meng, Yu
--This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10%-20% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17%, 0.67%, and 0.92% in T ext to Image R@1, Image to T ext R@1, and mR on RSITMD, respectively, and 0.04%, 2.93%, and 1.28% on RSICD. This work provides a new benchmark model and data support for remote sensing multimodal learning. ECENT years have seen significant progress in foundation models (FM) within the fields of computer vision (CV) and natural language processing (NLP) [1] [2] [3] [4] [5] [6] [7] [8]. This research was funded by the National Key R&D Program of China under grant number 2021YFB3900504. Weizhi Chen, Kai Li are with Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China, and also with School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China. Jingbo Chen, Y upeng Deng, Jiansheng Chen, Zhihao Xi, Diyou Liu, Y u Meng are with Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China. Y uman Feng is with the School of Information Network Security, People's Public Security University of China, Beijing 100038, China. Unlike models designed for specific task objectives, VLFM learns joint representations of massive image-text pairs in upstream tasks and then transfers this knowledge to various downstream tasks, demonstrating exceptional performance. Several outstanding VLFM models have already emerged, such as CLIP [10], BLIP [11] [12], and MaskVLM [13]. Meanwhile, researchers have begun exploring the application potential of VLFM in the remote sensing domain. However, VLFM often faces issues related to the long-tail effect (where a small number of classes dominate while the rest have fewer samples), making direct application to remote sensing tasks challenging [14].
- Asia > China > Beijing > Beijing (0.85)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- North America > United States (0.04)
- (3 more...)
- Information Technology (1.00)
- Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (1.00)
- Education (1.00)
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Dong, Zican, Li, Junyi, Jiang, Jinhao, Xu, Mingyu, Zhao, Wayne Xin, Wang, Bingning, Chen, Weipeng
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- (14 more...)
- Research Report > New Finding (0.93)
- Research Report > Promising Solution (0.75)