donut
The Confidence Paradox: Can LLM Know When It's Wrong
Tripathi, Sahil, Nafis, Md Tabrez, Hussain, Imran, Gao, Jiechao
Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses, especially under uncertainty. Existing models like LayoutLMv3, UDOP, and DONUT focus on accuracy but lack ethical calibration. We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning. We introduce two new metrics Honesty Score (H-Score) and Ethical Confidence Index (ECI)-to evaluate ethical alignment. HonestVQA improves accuracy and F1 by up to 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets, while reducing overconfidence. It also generalizes well across domains, achieving 78.9% accuracy and 76.1% F1-score.
DONUT: Physics-aware Machine Learning for Real-time X-ray Nanodiffraction Analysis
Luo, Aileen, Zhou, Tao, Du, Ming, Holt, Martin V., Singer, Andrej, Cherukara, Mathew J.
Coherent X-ray scattering techniques are critical for investigating the fundamental structural properties of materials at the nanoscale. While advancements have made these experiments more accessible, real-time analysis remains a significant bottleneck, often hindered by artifacts and computational demands. In scanning X-ray nanodiffraction microscopy, which is widely used to spatially resolve structural heterogeneities, this challenge is compounded by the convolution of the divergent beam with the sample's local structure. To address this, we introduce DONUT (Diffraction with Optics for Nanobeam by Unsupervised Training), a physics-aware neural network designed for the rapid and automated analysis of nanobeam diffraction data. By incorporating a differentiable geometric diffraction model directly into its architecture, DONUT learns to predict crystal lattice strain and orientation in real-time. Crucially, this is achieved without reliance on labeled datasets or pre-training, overcoming a fundamental limitation for supervised machine learning in X-ray science. We demonstrate experimentally that DONUT accurately extracts all features within the data over 200 times more efficiently than conventional fitting methods.
LLMs can implicitly learn from mistakes in-context
Alazraki, Lisa, Mozes, Maximilian, Campos, Jon Ander, Tan, Yi Chern, Rei, Marek, Bartolo, Max
Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.
"What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs
Zmigrod, Ran, Shetty, Pranav, Sibue, Mathieu, Ma, Zhiqiang, Nourbakhsh, Armineh, Liu, Xiaomo, Veloso, Manuela
The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template "What is the value for the {key}?". However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.
TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing
Zmigrod, Ran, Ma, Zhiqiang, Nourbakhsh, Armineh, Shah, Sameena
Visually Rich Form Understanding (VRFU) poses a complex research problem due to the documents' highly structured nature and yet highly variable style and content. Current annotation schemes decompose form understanding and omit key hierarchical structure, making development and evaluation of end-to-end models difficult. In this paper, we propose a novel F1 metric to evaluate form parsers and describe a new content-agnostic, tree-based annotation scheme for VRFU: TreeForm. We provide methods to convert previous annotation schemes into TreeForm structures and evaluate TreeForm predictions using a modified version of the normalized tree-edit distance. We present initial baselines for our end-to-end performance metric and the TreeForm edit distance, averaged over the FUNSD and XFUND datasets, of 61.5 and 26.4 respectively. We hope that TreeForm encourages deeper research in annotating, modeling, and evaluating the complexities of form-like documents.
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
Cao, Haoyu, Bao, Changcun, Liu, Chaohu, Chen, Huang, Yin, Kun, Liu, Hao, Liu, Yinsong, Jiang, Deqiang, Sun, Xing
We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Lee, Kenton, Joshi, Mandar, Turc, Iulia, Hu, Hexiang, Liu, Fangyu, Eisenschlos, Julian, Khandelwal, Urvashi, Shaw, Peter, Chang, Ming-Wei, Toutanova, Kristina
Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.
Supermassive black hole: First EVER full resolution photo is revealed
It is a thing of mesmerising beauty: humanity's first glimpse at the only full resolution photo of a supermassive black hole ever produced. This'orange donut', as it has been dubbed, sits at the heart of the Messier 87 galaxy 55 million light-years from Earth and in 2019 became the first black hole to be directly imaged by astronomers. Now, with the help of artificial intelligence (AI) machine learning, it has received its first official makeover -- and the results reveal that rather than being a'fuzzy donut', it is actually more of a'skinny donut'. Scientists say this new perspective of the supermassive black hole will'play a critical role in our ability to understand its behaviour' and could help explain how the stellar phenomenon'eats' matter. They called it a'golden opportunity' to learn more about black hole physics.
Solve a mystery box like a data scientist
What happens when a data scientist gets a riddle in form of a box? Of course he will (try) approach it as a data problem. In this article I will describe the whole process, and to be honest, it was not as easy as I thought. As with many problems, you can get completely lost and only by talking to a couple of friends, I got back on track again. As a data scientist, I like to approach this problem in a data manner. I realize that this method is far from the most obvious solution. But it was a very fun endeavor. Collecting too much data, train a transformer model to extract values from a video, and eventually use a minimizer to find the solution. This article is a summary of this (mostly) fun journey! I have divided this article in a couple of (for me) logical steps. All images in this article have been taken or are generated by me unless stated otherwise in the separate captions (which is none in this article).
What and Why Tidy Data?
Data scientists like to work with tidy data because it makes the data easier to work with. Visualizations, data manipulation, and modeling are made much easier when working with tidy data. Common coding environments for data science, including R Studio, Pandas in Python, and related packages have been designed to work well with tidy data. The first critical step in investigating a dataset is tidying. We will take a look at each rule from R for Data Science and see how you can format a data frame for each donut that you, as a data scientist/baker can use to visualize, explore, or model your data.