Goto

Collaborating Authors

 woodpecker


Woodpeckers grunt like tennis players

Popular Science

'They take the pecking that we see all birds doing and take it to the extreme.' Breakthroughs, discoveries, and DIY tips sent every weekday. Woodpeckers really know how to punch above their weight. The woodland birds can attack a tree at about 15 miles per hour with their powerful beaks . To achieve this, woodpeckers essentially turn themselves into hammers, by bracing their head, neck, abdomen, and tail muscles to hold their bodies completely rigid when they pound into wood.


CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base

Nguyen, Cong-Duy, Wu, Xiaobao, Vu, Duc Anh, Zhao, Shuai, Nguyen, Thong, Luu, Anh Tuan

arXiv.org Artificial Intelligence

Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, but they remain susceptible to hallucination, particularly object hallucination where non-existent objects or incorrect attributes are fabricated in generated descriptions. Existing detection methods achieve strong performance but rely heavily on expensive API calls and iterative LVLM-based validation, making them impractical for large-scale or offline use. To address these limitations, we propose CutPaste\&Find, a lightweight and training-free framework for detecting hallucinations in LVLM-generated outputs. Our approach leverages off-the-shelf visual and linguistic modules to perform multi-step verification efficiently without requiring LVLM inference. At the core of our framework is a Visual-aid Knowledge Base that encodes rich entity-attribute relationships and associated image representations. We introduce a scaling factor to refine similarity scores, mitigating the issue of suboptimal alignment values even for ground-truth image-text pairs. Comprehensive evaluations on benchmark datasets, including POPE and R-Bench, demonstrate that CutPaste\&Find achieves competitive hallucination detection performance while being significantly more efficient and cost-effective than previous methods.


A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Chang, Yue, Jing, Liqiang, Zhang, Xiaopeng, Zhang, Yue

arXiv.org Artificial Intelligence

Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments.


Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models

Wang, Kohou, Liu, Xiang, Liu, Zhaoxiang, Wang, Kai, Lian, Shiguo

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) have made significant progress in bridging the gap between visual and language modalities. However, hallucinations in MLLMs, where the generated text does not align with image content, continue to be a major challenge. Existing methods for addressing hallucinations often rely on instruction-tuning, which requires retraining the model with specific data, which increases the cost of utilizing MLLMs further. In this paper, we introduce a novel training-free method, named Piculet, for enhancing the input representation of MLLMs. Piculet leverages multiple specialized models to extract descriptions of visual information from the input image and combine these descriptions with the original image and query as input to the MLLM. We evaluate our method both quantitively and qualitatively, and the results demonstrate that Piculet greatly decreases hallucinations of MLLMs. Our method can be easily extended to different MLLMs while being universal.


Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance

Zhao, Linxi, Deng, Yihe, Zhang, Weitong, Gu, Quanquan

arXiv.org Artificial Intelligence

The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs to correct the model's output post-generation. In this paper, we tackle this challenge by introducing a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE), which is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process. Specifically, MARINE enriches the visual context of LVLMs by integrating existing open-source vision models, and employs classifier-free guidance to incorporate the additional object grounding features to improve the precision of LVLMs' generations. Through comprehensive evaluations across $6$ popular LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it not only reduces hallucinations but also improves the detailedness of LVLMs' generations, as assessed by GPT-4V.


Woodpecker: Hallucination Correction for Multimodal Large Language Models

Yin, Shukang, Fu, Chaoyou, Zhao, Sirui, Xu, Tong, Wang, Hao, Sui, Dianbo, Shen, Yunhang, Li, Ke, Sun, Xing, Chen, Enhong

arXiv.org Artificial Intelligence

Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.


The Software Industry Is Still the Problem

Communications of the ACM

Around the time computers were old enough to drink, software engineering guru Gerald Weinberg said: "If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization." This is not a plotline science fiction authors have ever neglected. Actually, some titles are still worth a trip to the library: for example, Poul Anderson's Sam Hall from 1953, which shows how too much reliance on "infallible" computer surveillance can turn into an autoimmune collapse for a nation-state, or, for that matter, any large organization. At the more obscure end of the spectrum, there is Swedish Nobel Laureate Hannes Alfvén, publishing in Swedish under the pseudonym Oluf Johannesson, with Sagan om den stora Datamaskinen [Tale of the Big Computer] from 1966. As with almost all science fiction pieces, however, they miss the future by a wide margin.


Machine vision that sees things more the way we do is easier for us to understand

#artificialintelligence

A new image recognition algorithm uses the way humans see things for inspiration. The context: When humans look at a new image of something, we identify what it is based on a collection of recognizable features. We might identify the species of a bird, for example, by the contour of its beak, the colors of its plume, and the shape of its feet. A neural network, however, simply looks for pixel patterns across the entire image without discriminating between the actual bird and its background. This makes the neural network more vulnerable to mistakes and makes it harder for humans to diagnose them.


I used facial recognition technology on birds

#artificialintelligence

As a birder, I had heard that if you paid careful attention to the head feathers on the downy woodpeckers that visited your bird feeders, you could begin to recognize individual birds. I even went so far as to try sketching birds at my own feeders and had found this to be true, up to a point. In the meantime, in my day job as a computer scientist, I knew that other researchers had used machine learning techniques to recognize individual faces in digital images with a high degree of accuracy. These projects got me thinking about ways to combine my hobby with my day job. Would it be possible to apply those techniques to identify individual birds?