AITopics | outlier sample

Collaborating Authors

outlier sample

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Seo, Yeongbin, Kim, Gayoung, Kim, Jaehyung, Yeo, Jinyoung

arXiv.org Artificial IntelligenceSep-30-2025

As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.18577

Genre: Research Report > New Finding (0.67)

Industry:

Education (0.47)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples

Mirzaei, Hossein, Jafari, Mohammad, Dehbashi, Hamid Reza, Ansari, Ali, Ghobadi, Sepehr, Hadi, Masoud, Moakhar, Arshia Soltani, Azizmalayeri, Mohammad, Baghshah, Mahdieh Soleymani, Rohban, Mohammad Hossein

arXiv.org Artificial IntelligenceJan-28-2025

In recent years, there have been significant improvements in various forms of image outlier detection. However, outlier detection performance under adversarial settings lags far behind that in standard settings. This is due to the lack of effective exposure to adversarial scenarios during training, especially on unseen outliers, leading to detection models failing to learn robust features. To bridge this gap, we introduce RODEO, a data-centric approach that generates effective outliers for robust outlier detection. More specifically, we show that incorporating outlier exposure (OE) and adversarial training can be an effective strategy for this purpose, as long as the exposed training outliers meet certain characteristics, including diversity, and both conceptual differentiability and analogy to the inlier samples. We leverage a text-to-image model to achieve this goal. We demonstrate both quantitatively and qualitatively that our adaptive OE method effectively generates ``diverse'' and ``near-distribution'' outliers, leveraging information from both text and image domains. Moreover, our experimental results show that utilizing our synthesized outliers significantly enhances the performance of the outlier detector, particularly in adversarial settings.

data mining, detection, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.16971

Country:

Europe > Austria > Vienna (0.14)
North America > United States (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Reviews: Effective End-to-end Unsupervised Outlier Detection via Inlier Priority of Discriminative Network

Neural Information Processing SystemsJan-24-2025, 12:43:55 GMT

I think that even inliers do not have a unified learning target in AE-based methods is the key reason why AE-based methods fails. It will be nice to empirically verify this. For example, the authors can do a similar experiment as that in Figure 2. Anyway, I believe this paper has its contribution to the community. And surprisingly, the algorithms greatly improve the outlier detection performance compared to multiple AE-based methods. Questions: - The key point of this method is to create pseudo labels for the unlabeled data, which will augment the training data by default.

ae-based method, dimension, opération, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.62)

Add feedback

TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification

Zhu, Zheng-An, Chien, Hsin-Che, Chiang, Chen-Kuo

arXiv.org Artificial IntelligenceJan-15-2025

This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{https://github.com/andy412510/TCMM}{https://github.com/andy412510/TCMM}.

outlier sample, person re-identification, pseudo label, (11 more...)

arXiv.org Artificial Intelligence

2501.09044

Country:

Asia > Taiwan (0.04)
Europe > Switzerland (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Explaining Model Overfitting in CNNs via GMM Clustering

Dou, Hui, Mu, Xinyu, Yi, Mengjun, Han, Feng, Zhao, Jian, Shen, Furao

arXiv.org Artificial IntelligenceDec-12-2024

Convolutional Neural Networks (CNNs) have demonstrated remarkable prowess in the field of computer vision. However, their opaque decision-making processes pose significant challenges for practical applications. In this study, we provide quantitative metrics for assessing CNN filters by clustering the feature maps corresponding to individual filters in the model via Gaussian Mixture Model (GMM). By analyzing the clustering results, we screen out some anomaly filters associated with outlier samples. We further analyze the relationship between the anomaly filters and model overfitting, proposing three hypotheses. This method is universally applicable across diverse CNN architectures without modifications, as evidenced by its successful application to models like AlexNet and LeNet-5. We present three meticulously designed experiments demonstrating our hypotheses from the perspectives of model behavior, dataset characteristics, and filter impacts. Through this work, we offer a novel perspective for evaluating the CNN performance and gain new insights into the operational behavior of model overfitting.

anomaly filter, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2412.10457

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
North America > United States > Nevada > Clark County > Las Vegas (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Improving Data-aware and Parameter-aware Robustness for Continual Learning

Xiao, Hanxi, Lyu, Fan

arXiv.org Artificial IntelligenceMay-27-2024

The goal of Continual Learning (CL) task is to continuously learn multiple new tasks sequentially while achieving a balance between the plasticity and stability of new and old knowledge. This paper analyzes that this insufficiency arises from the ineffective handling of outliers, leading to abnormal gradients and unexpected model updates. To address this issue, we enhance the data-aware and parameter-aware robustness of CL, proposing a Robust Continual Learning (RCL) method. From the data perspective, we develop a contrastive loss based on the concepts of uniformity and alignment, forming a feature distribution that is more applicable to outliers. From the parameter perspective, we present a forward strategy for worst-case perturbation and apply robust gradient projection to the parameters. The experimental results on three benchmarks show that the proposed method effectively maintains robustness and achieves new state-of-the-art (SOTA) results. The code is available at: https://github.com/HanxiXiao/RCL

continual learning, gradient, learning, (14 more...)

arXiv.org Artificial Intelligence

2405.17054

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Outlier-robust Estimation of a Sparse Linear Model Using Invexity

Barik, Adarsh, Honorio, Jean

arXiv.org Artificial IntelligenceJun-22-2023

In this paper, we study problem of estimating a sparse regression vector with correct support in the presence of outlier samples. The inconsistency of lasso-type methods is well known in this scenario. We propose a combinatorial version of outlier-robust lasso which also identifies clean samples. Subsequently, we use these clean samples to make a good estimation. We also provide a novel invex relaxation for the combinatorial problem and provide provable theoretical guarantees for this relaxation. Finally, we conduct experiments to validate our theory and compare our results against standard lasso.

artificial intelligence, machine learning, probability, (18 more...)

arXiv.org Artificial Intelligence

2306.12678

Country:

North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models

He, Guande, Chen, Jianfei, Zhu, Jun

arXiv.org Artificial IntelligenceMay-30-2023

Large pre-trained language models (PLMs) have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning. However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings. In this paper, we tackle the problem of calibrating finetuned language models. We demonstrate that the PLMs are well-calibrated on the masked language modeling task with robust predictive confidence under domain shift, yet the fine-tuned models fail to retain such property due to catastrophic forgetting, which impacts the calibration on the downstream classification task. In light of these observations, we evaluate the calibration of several methods that preserve pre-trained features and show that preserving pre-trained features can improve the calibration of fine-tuned language models. Among these methods, our proposed method that encourages the fine-tuned model to learn generative representations with auxiliary language modeling objective achieves competitive accuracy and the lowest expected calibration error compared to several strong baselines under both in-domain and out-of-domain settings on three downstream NLU tasks. Fine-tuning pre-trained language models (PLMs) is a dominating paradigm for natural language understanding (NLU) with state-of-the-art results for a variety of NLU tasks (Peters et al., 2018; Devlin et al., 2019; Liu et al., 2019; He et al., 2021a). The powerful fine-tuned language models have been experimented with for decision-making in real-world applications such as the healthcare domain (He et al., 2020) and safety-critical domain (Sandagiri et al., 2020), where the classification networks need to be highly accurate and provide calibrated confidence for their predictions to improve the safety and trustiness of the models (Guo et al., 2017). For example, suppose a medical language inference LM that predicts the disease given the description of symptoms is well-calibrated, i.e., the model's posterior probabilities (or confidence) align well with the true correctness likelihood. In that case, the wrong predictions can be easier to detect and correct by human doctors by given low predictive confidence.

calibration, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.19249

Country:

Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > India (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.54)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Filters

Collaborating Authors

outlier sample

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

9719a00ed0c5709d80dfef33795dcef3-Paper.pdf

Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

9719a00ed0c5709d80dfef33795dcef3-Paper.pdf

RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples

Reviews: Effective End-to-end Unsupervised Outlier Detection via Inlier Priority of Discriminative Network

TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification

Explaining Model Overfitting in CNNs via GMM Clustering

Improving Data-aware and Parameter-aware Robustness for Continual Learning

Outlier-robust Estimation of a Sparse Linear Model Using Invexity

Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models