Goto

Collaborating Authors

 receipt


Trustless Federated Learning at Edge-Scale: A Compositional Architecture for Decentralized, Verifiable, and Incentive-Aligned Coordination

Onobhayedo, Pius, Oamen, Paul Osemudiame

arXiv.org Artificial Intelligence

Artificial intelligence is retracing the Internet's path from centralized provision to distributed creation. Initially, resource-intensive computation concentrates within institutions capable of training and serving large models. Eventually, as federated learning matures, billions of edge devices holding sensitive data will be able to collectively improve models without surrendering raw information, enabling both contribution and consumption at scale. This democratic vision remains unrealized due to certain compositional gaps; aggregators handle updates without accountability, economic mechanisms are lacking and even when present remain vulnerable to gaming, coordination serializes state modifications limiting scalability, and governance permits retroactive manipulation. This work addresses these gaps by leveraging cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation. The product of this work is a design architecture--not an actual implementation--that seeks to pass the baton in the race toward truly collaborative intelligence; an intelligence of the people, by the people, for the people.


E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing

Jeong, Cheonsu, Sim, Seongmin, Cho, Hyoyoung, Kim, Sungsu, Shin, Byounggwan

arXiv.org Artificial Intelligence

This paper presents an intelligent work automation approach in the context of contemporary digital transformation by integrating generative AI and Intelligent Document Processing (IDP) technologies with an Automation Agent to realize End-to-End (E2E) automation of corporate financial expense processing tasks. While traditional Robotic Process Automation (RPA) has proven effective for repetitive, rule-based simple task automation, it faces limitations in handling unstructured data, exception management, and complex decision-making. This study designs and implements a four-stage integrated process comprising automatic recognition of supporting documents such as receipts via OCR/IDP, item classification based on a policy-driven database, intelligent exception handling supported by generative AI (large language models, LLMs), and human-in-the-loop final decision-making with continuous system learning through an Automation Agent. Applied to a major Korean enterprise (Company S), the system demonstrated quantitative benefits including over 80% reduction in processing time for paper receipt expense tasks, decreased error rates, and improved compliance, as well as qualitative benefits such as enhanced accuracy and consistency, increased employee satisfaction, and data-driven decision support. Furthermore, the system embodies a virtuous cycle by learning from human judgments to progressively improve automatic exception handling capabilities. Empirically, this research confirms that the organic integration of generative AI, IDP, and Automation Agents effectively overcomes the limitations of conventional automation and enables E2E automation of complex corporate processes. The study also discusses potential extensions to other domains such as accounting, human resources, and procurement, and proposes future directions for AI-driven hyper-automation development.


PRISON: Unmasking the Criminal Potential of Large Language Models

Wu, Xinyi, Hong, Geng, Chen, Pei, Chen, Yueyue, Pan, Xudong, Yang, Min

arXiv.org Artificial Intelligence

Scenario to be rewritten: { scenario } To rigorously evaluate whether large language models (LLMs) could still recognize the original source behind these rewritten scenarios, we designed three complementary prompt strategies, each probing different aspects of the models' recognition and reasoning capabilities. Zero-shot Direct Identification focuses on testing the model's raw ability to recall source material under minimal guidance (Brown et al., 2020; Mu et al., 2024). Paraphrased Queries introduce linguistic variation to reduce prompt-specific biases and measure the robustness of recognition (Liu et al., 2024a; Ngweta et al., 2025). Instruction-tuned T ask-framed Prompts leverage explicit role framing and step-by-step task descriptions to maximize retrieval pressure and analytical reasoning (Ouyang et al., 2022; Sivarajkumar et al., 2024). By combining these strategies, we construct a comprehensive recognition test that balances sensitivity and robustness, ensuring that a scenario is only deemed valid if no prompt family leads to a confident and correct identification of the original work. This integrated approach provides a stronger safeguard against hidden memorization and enables more reliable downstream behavioral analysis of the tested LLMs. V alidation Prompt We designed three prompt families for scenario source identification. Each family targets a different aspect of model behavior: Given the following scenario: 19 { scenario } 1. Zero-shot Identification Please determine whether this scenario originates from a known literary or cinematic work.


Efficient Test-Time Scaling for Small Vision-Language Models

Kaya, Mehmet Onurcan, Elliott, Desmond, Papadopoulos, Dim P.

arXiv.org Artificial Intelligence

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.


RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies

Mukerji, Arjun, Jackson, Michael L., Jones, Jason, Sanghavi, Neil

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.


The best part of the future is finally having a permanent replacement for this annoying technology

Popular Science

We don't have flying cars, jetpacks haven't replaced walking, and I have not seen a single sign that we're all pivoting to wearing matching silver jumpsuits. The future is kind of lame. SwiftScan VIP is a scanner tool that basically replaces half of your old office equipment with an app that works on iOS and Android devices. It's also a lot cheaper than some desktop scanners, and you don't need to replace it every few years. During this limited-time sale, you can get a SwiftScan VIP Lifetime Subscription for only 41.99 (it's usually 199.99).


We tried this app and haven't touched a printer or scanner since

Popular Science

And neither is easy to use when you're in a rush. So we, the StackCommerce deals team, tested an app that claims to do it all--scanning, signing, saving, and even faxing documents--right from your phone. After one week, we forgot any other methods existed. The SwiftScan document scanner and PDF editor app swept us off our feet. Plus, this weekend only, you can save an extra 18 on a lifetime subscription to the iOS and Android apps with code TAKE30 at checkout, dropping the price from 59.99 to 41.99.


Improving Applicability of Deep Learning based Token Classification models during Training

Mehra, Anket, Prieß, Malte, Himstedt, Marian

arXiv.org Artificial Intelligence

This paper shows that further evaluation metrics during model training are needed to decide about its applicability in inference. As an example, a LayoutLM-based model is trained for token classification in documents. The documents are German receipts. We show that conventional classification metrics, represented by the F1-Score in our experiments, are insufficient for evaluating the applicability of machine learning models in practice. To address this problem, we introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task. To the best of our knowledge, nothing comparable has been introduced in this context. DIP is a rigorous metric, describing how many documents of the test dataset require manual interventions. It enables AI researchers and software developers to conduct an in-depth investigation of the level of process automation in business software. In order to validate DIP, we conduct experiments with our created models to highlight and analyze the impact and relevance of DIP to evaluate if the model should be deployed or not in different training settings. Our results demonstrate that existing metrics barely change for isolated model impairments, whereas DIP indicates that the model requires substantial human interventions in deployment. The larger the set of entities being predicted, the less sensitive conventional metrics are, entailing poor automation quality. DIP, in contrast, remains a single value to be interpreted for entire entity sets. This highlights the importance of having metrics that focus on the business task for model training in production. Since DIP is created for the token classification task, more research is needed to find suitable metrics for other training tasks.


Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction

Loem, Mengsay, Hosaka, Taiju

arXiv.org Artificial Intelligence

Visual question answering (VQA) has emerged as a flexible approach for extracting specific pieces of information from document images. However, existing work typically queries each field in isolation, overlooking potential dependencies across multiple items. This paper investigates the merits of extracting multiple fields jointly versus separately. Through experiments on multiple large vision language models and datasets, we show that jointly extracting fields often improves accuracy, especially when the fields share strong numeric or contextual dependencies. We further analyze how performance scales with the number of requested items and use a regression based metric to quantify inter field relationships. Our results suggest that multi field prompts can mitigate confusion arising from similar surface forms and related numeric values, providing practical methods for designing robust VQA systems in document information extraction tasks.


9 useful apps that plug into Spotify

Popular Science

When apps and platforms get as big as Spotify, they start to attract all kinds of add-ons, extensions, and plug-ins. Extra tools from third-party developers can introduce new functionality or helpfully tweak some part of the core experience. Of course Spotify is already packed with features, but these additional apps that run on top of Spotify can help you get even more from your music and the platform. Give one or more of them a whirl with your own account to see if they can find a place in your music streaming setup. PlaylistAI works as an iOS app or a ChatGPT plug-in, and can then export created playlists to Spotify (and several other streaming music platforms)--the idea is you describe the type of music you want in your playlist (whether it's for a long road trip or a quick workout session at the gym), and the AI makes some tailored suggestions. Spotify's recommendation algorithms are fine, but with Discoverify, it could be even better.