hard negative sample
A Large-scale Dataset for Robust Complex Anime Scene Text Detection
Dong, Ziyi, Zhang, Yurui, Li, Changmao, Golding, Naomi Rue, Long, Qing
Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: https://huggingface.co/datasets/deepghs/AnimeText
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to specialised visual domains, such as diagrams, which encode structured, symbolic information distinct from that of natural imagery. In this paper, we introduce a novel training paradigm explicitly designed to enhance the comprehension of diagrammatic images within vision-language models. Our approach uses ``hard'' samples for our proposed contrastive learning that incorporates two specialised loss functions that leverage the inherent structural properties of diagrams. By integrating these objectives into model training, our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content. We empirically validate our approach on a benchmark dataset of flowcharts, as a representative class of diagrammatic imagery, demonstrating substantial improvements over standard CLIP and conventional hard negative CLIP learning paradigms for both image-text matching and visual question answering tasks. Our findings underscore the significance of tailored training strategies for specialised tasks and contribute to advancing diagrammatic understanding within the broader landscape of vision-language integration.
Negative Matters: Multi-Granularity Hard-Negative Synthesis and Anchor-Token-Aware Pooling for Enhanced Text Embeddings
Pan, Tengyu, Duan, Zhichao, Li, Zhenyu, Dong, Bowen, Liu, Ning, Li, Xiuxing, Wang, Jianyong
Text embedding models are essential for various natural language processing tasks, enabling the effective encoding of semantic information into dense vector representations. These models are typically optimized using triplets of (query, positive, negative) data pairs for contrastive learning, where the negative samples play a critical role in enhancing the model's ability to discern subtle semantic distinctions. In this work, we introduce a Multi-Granularity Hard-negative (MGH) synthesis framework that leverages large language models (LLMs) to generate diverse negative samples with varying levels of similarity with the query. This approach facilitates a coarse-to-fine curriculum learning strategy during supervised training, allowing the embedding model to progressively learn more nuanced semantic representations. Meanwhile, we propose an Anchor Token Aware (ATA) pooling method that assigns higher weights to anchor tokens based on aggregation patterns observed in LLMs, improving text embedding accuracy without increasing model complexity. Comprehensive experiments on the MTEB benchmark demonstrate that our methods achieve state-of-the-art performance, surpassing existing synthesis strategies both with synthetic data and when combined with public retrieval datasets.
Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Huang, Xin, Li, Ruibin, Jia, Tong, Zheng, Wei, Wang, Ya
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
Fuzzy Cluster-Aware Contrastive Clustering for Time Series
Wang, Congyu, Du, Mingjing, Jiang, Xiang, Dong, Yongquan
The rapid growth of unlabeled time series data, driven by the Internet of Things (IoT), poses significant challenges in uncovering underlying patterns. Traditional unsupervised clustering methods often fail to capture the complex nature of time series data. Recent deep learning-based clustering approaches, while effective, struggle with insufficient representation learning and the integration of clustering objectives. To address these issues, we propose a fuzzy cluster-aware contrastive clustering framework (FCACC) that jointly optimizes representation learning and clustering. Our approach introduces a novel three-view data augmentation strategy to enhance feature extraction by leveraging various characteristics of time series data. Additionally, we propose a cluster-aware hard negative sample generation mechanism that dynamically constructs high-quality negative samples using clustering structure information, thereby improving the model's discriminative ability. By leveraging fuzzy clustering, FCACC dynamically generates cluster structures to guide the contrastive learning process, resulting in more accurate clustering. Extensive experiments on 40 benchmark datasets show that FCACC outperforms the selected baseline methods (eight in total), providing an effective solution for unsupervised time series learning.
Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization
An, Ruichuan, Zeng, Kai, Lu, Ming, Yang, Sihan, Zhang, Renrui, Ji, Huitong, Zhang, Qizhe, Luo, Yulin, Liang, Hao, Zhang, Wentao
Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at \href{https://github.com/zengkaiya/CaT}{https://github.com/zengkaiya/CaT}.
Review for NeurIPS paper: The Dilemma of TriHard Loss and an Element-Weighted TriHard Loss for Person Re-Identification
Summary and Contributions: This paper aims to alleviate the dilemma of triplet loss when facing hard negative samples. The "dilemma" mentioned in this paper means that the similarity between the anchor and positive sample is useful for better representation, but the similarity between the anchor and negative sample should be repelled, so processing the similarity of anchor, positive sample and hard negative samples is a dilemma problem in triplet loss. To solve this problem, an Element-weighted TriHard Loss function is designed in this paper. The main idea is to weight the feature vectors of anchor and negative sample before calculating their distance to find discriminative elements of their feature vectors. Meanwhile, three mitigation strategies of TriHard loss dilemma are discussed in this paper.
Molecular Graph Contrastive Learning with Line Graph
Chen, Xueyuan, Li, Shangzhe, Liu, Ruomei, Shi, Bowen, Liu, Jiaheng, Wu, Junran, Xu, Ke
Trapped by the label scarcity in molecular property prediction and drug design, graph contrastive learning (GCL) came forward. Leading contrastive learning works show two kinds of view generators, that is, random or learnable data corruption and domain knowledge incorporation. While effective, the two ways also lead to molecular semantics altering and limited generalization capability, respectively. To this end, we relate the \textbf{L}in\textbf{E} graph with \textbf{MO}lecular graph co\textbf{N}trastive learning and propose a novel method termed \textit{LEMON}. Specifically, by contrasting the given graph with the corresponding line graph, the graph encoder can freely encode the molecular semantics without omission. Furthermore, we present a new patch with edge attribute fusion and two local contrastive losses enhance information transmission and tackle hard negative samples. Compared with state-of-the-art (SOTA) methods for view generation, superior performance on molecular property prediction suggests the effectiveness of our proposed framework.