AITopics | Bi, Keping

Collaborating Authors

Bi, Keping

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification

Zhang, Mingkun, Bi, Keping, Chen, Wei, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceMar-2-2025

A BSTRACT In this paper, we aim to build an adversarially robust zero-shot image classifier. We ground our work on CLIP, a vision-language pre-trained encoder model that can perform zero-shot classification by matching an image with text prompts "a photo of a < class-name> .". Purification is the path we choose since it does not require adversarial training on specific attack types and thus can cope with any foreseen attacks. We then formulate purification risk as the KL divergence between the joint distributions of the purification process of denoising the adversarial samples and the attack process of adding perturbations to benign samples, through bidirectional Stochastic Differential Equations (SDEs). The final derived results inspire us to explore purification in the multi-modal latent space of CLIP . We propose two variants for our CLIPure approach: CLIPure-Diff which models the likelihood of images' latent vectors with the DiffusionPrior module in DaLLE-2 (modeling the generation process of CLIP's latent vectors), and CLIPure-Cos which models the likelihood with the cosine similarity between the embeddings of an image and "a photo of a.". As far as we know, CLIPure is the first purification method in multi-modal latent space and CLIPure-Cos is the first purification method that is not based on generative models, which substantially improves defense efficiency. We conducted extensive experiments on CIFAR-10, ImageNet, and 13 datasets that previous CLIP-based defense methods used for evaluating zero-shot classification robustness. Among them, CLIP (Radford et al., 2021) is an example that is popular, effective, and efficient. CLIP performs zero-shot classification by forming text prompts "a photo of a < class-name> ." of all the candidate categories, and selecting the class with the highest similarity with the image embedding. Despite its efficacy, when facing adversarial attacks, its accuracy can drop to zero, similarly vulnerable to other neural classifiers. Existing methods to enhance adversarial robustness follow two primary paths: adversarial training and purification. Adversarial Training (A T) (Madry et al., 2017; Rebuffi et al., 2021; Wang et al., 2023) incorporates adversarial examples into model training to boost robustness. It often achieves corresponding authors 1 arXiv:2502.18176v2 FARE (Schlarmann et al., 2024) and TeCoA (Mao et al., 2022) are two A T approaches integrated with CLIP, which enhance CLIP's zero-shot classification robustness while harming clean accuracy significantly and do not generalize to other types of attacks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.18176

Country:

Asia > China (0.14)
Europe > Germany (0.14)
Europe > Spain (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.46)
Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Estimating Commonsense Plausibility through Semantic Shifts

Cui, Wanqing, Bi, Keping, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceFeb-19-2025

Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches--reliant on likelihoods or verbalized judgments--struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models' ability to capture semantic nuances, thereby further enhancing ComPaSS.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.13464

Country:

Asia > China (0.14)
Asia > Thailand (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception

Ni, Shiyu, Bi, Keping, Guo, Jiafeng, Yu, Lulu, Bi, Baolong, Cheng, Xueqi

arXiv.org Artificial IntelligenceFeb-17-2025

Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries, leading to confident yet incorrect responses. This paper explores leveraging LLMs' internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives. We investigate whether LLMs can estimate their confidence using internal states before response generation, potentially saving computational resources. Our experiments on datasets like Natural Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant pre-generation perception, which is further refined post-generation, with perception gaps remaining stable across varying conditions. To mitigate risks in critical domains, we introduce Consistency-based Confidence Calibration ($C^3$), which assesses confidence consistency through question reformulation. $C^3$ significantly improves LLMs' ability to recognize their knowledge gaps, enhancing the unknown perception rate by 5.6\% on NQ and 4.9\% on HotpotQA. Our findings suggest that pre-generation confidence estimation can optimize efficiency, while $C^3$ effectively controls output risks, advancing the reliability of LLMs in practical applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.11677

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs

Yang, Sihui, Bi, Keping, Cui, Wanqing, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceSep-30-2024

Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. The commonly used automatic evaluation metrics like ROUGE or BERTScore cannot accurately measure semantic similarities or answers from different perspectives. Recently, Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. Common approaches include pointwise scoring of each candidate answer and pairwise comparisons between answers. Inspired by the evolution from pointwise to pairwise to listwise in learning-to-rank methods, we propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality. Moreover, for NF questions that do not have multi-grade or any golden answers, we leverage LLMs to generate the reference answer list of various quality to facilitate the listwise evaluation. Extensive experimental results on three NFQA datasets, i.e., ANTIQUE, the TREC-DL-NF, and WebGLM show that our method has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.

candidate answer, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2409.14744

Country:

Asia > China (0.14)
North America > United States (0.14)
Europe > Spain (0.14)
Europe > Portugal (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy

Zhang, Hengran, Bi, Keping, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceJun-17-2024

Utility and topical relevance are critical measures in information retrieval (IR), reflecting system and user perspectives, respectively. While topical relevance has long been emphasized, utility is a higher standard of relevance and is more useful for facilitating downstream tasks, e.g., in Retrieval-Augmented Generation (RAG). When we incorporate utility judgments into RAG, we realize that the topical relevance, utility, and answering in RAG are closely related to the three types of relevance that Schutz discussed from a philosophical perspective. They are topical relevance, interpretational relevance, and motivational relevance, respectively. Inspired by the dynamic iterations of the three types of relevance, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step of the cycle of RAG. We conducted extensive experiments on multi-grade passage retrieval and factoid question-answering datasets (i.e., TREC DL, WebAP, and NQ). Experimental results demonstrate significant improvements in utility judgments, ranking of topical relevance, and answer generation upon representative baselines, including multiple single-shot utility judging approaches. Our code and benchmark can be found at https://anonymous.4open.science/r/ITEM-B486/.

large language model, machine learning, utility judgment, (21 more...)

arXiv.org Artificial Intelligence

2406.1129

Country:

North America > United States (1.00)
Europe (0.92)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Television (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

Cui, Wanqing, Bi, Keping, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceJun-13-2024

Since commonsense information has been recorded significantly less frequently than its existence, language models pre-trained by text generation have difficulty to learn sufficient commonsense knowledge. Several studies have leveraged text retrieval to augment the models' commonsense ability. Unlike text, images capture commonsense information inherently but little effort has been paid to effectively utilize them. In this work, we propose a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models. Extensive experiments on the Common-Gen task have demonstrated the efficacy of MORE based on the pre-trained models of both single and multiple modalities.

information, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2402.13625

Country:

Asia > China (0.14)
North America > United States (0.14)
North America > Canada (0.14)

Genre: Research Report (1.00)

Industry: Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.73)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation

Ni, Shiyu, Bi, Keping, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceJun-11-2024

Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge and tend to provide specious answers in such cases. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations. However, due to the extra overhead and unassured quality of retrieval, it may not be optimal to conduct RA all the time. A straightforward idea is to only conduct retrieval when LLMs are uncertain about a question. This motivates us to enhance the LLMs' ability to perceive their knowledge boundaries to help RA. In this paper, we first quantitatively measure LLMs' such ability and confirm their overconfidence. Then, we study how LLMs' certainty about a question correlates with their dependence on external retrieved information. We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence. Additionally, equipped with these methods, LLMs can achieve comparable or even better performance of RA with much fewer retrieval calls.

artificial intelligence, large language model, natural language, (12 more...)

arXiv.org Artificial Intelligence

2402.11457

Country:

Asia (0.28)
North America > Canada (0.14)

Genre: Research Report (0.40)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback