relevant region
DocVXQA: Context-Aware Visual Explanations for Document Question Answering
Souibgui, Mohamed Ali, Choi, Changkyu, Barsky, Andrey, Jung, Kangsoo, Valveny, Ernest, Karatzas, Dimosthenis
We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Norway (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- (4 more...)
- Overview (0.93)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.62)
Review for NeurIPS paper: Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets
This paper still considers the only resolution, depth and width dimensions, which have been studied in EfficientNet. Although the discovery in this paper that "resolution and depth are more important than width for tiny networks" is different from the conclusion in EfficientNet, I feel this point is not significant enough and it seems like just a supplement for EfficientNet. I'm not saying that this kind of method is not good, but I think the insights and intuitions why resolution and depth are more important than width for small networks (derived from this way) are still not clear. In my opinion, this paper is basically doing random search by shrinking the EfficientNet-B0 structure configurations on the mentioned three dimensions, I believe the derived observation is useful but the method itself contains very limited value to the community. Even some simple searching method like evolutionary searching can achieve similar or the same purpose through a more efficient way.
Relevant Region Sampling Strategy with Adaptive Heuristic for Asymptotically Optimal Path Planning
Li, Chenming, Meng, Fei, Ma, Han, Wang, Jiankun, Meng, Max Q. -H.
Sampling-based planning algorithm is a powerful tool for solving planning problems in high-dimensional state spaces. In this article, we present a novel approach to sampling in the most promising regions, which significantly reduces planning time-consumption. The RRT# algorithm defines the Relevant Region based on the cost-to-come provided by the optimal forward-searching tree. However, it uses the cumulative cost of a direct connection between the current state and the goal state as the cost-to-go. To improve the path planning efficiency, we propose a batch sampling method that samples in a refined Relevant Region with a direct sampling strategy, which is defined according to the optimal cost-to-come and the adaptive cost-to-go, taking advantage of various sources of heuristic information. The proposed sampling approach allows the algorithm to build the search tree in the direction of the most promising area, resulting in a superior initial solution quality and reducing the overall computation time compared to related work. To validate the effectiveness of our method, we conducted several simulations in both $SE(2)$ and $SE(3)$ state spaces. And the simulation results demonstrate the superiorities of proposed algorithm.
- Asia > China > Guangdong Province > Shenzhen (0.05)
- Asia > China > Hong Kong (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Research Report > Promising Solution (0.48)
- Research Report > New Finding (0.48)
Picture Perfect - Hackster.io
As machine learning algorithms continue to advance, the need for good, accurately annotated datasets is becoming increasingly apparent. With less and less room for optimization of the models themselves, more attention is finally being turned to addressing issues with data quality. After all, no matter how much potential a particular model has, that potential cannot be realized without a good dataset to learn from. Image classification is a common task for machine learning models, and these models suffer from a particular type of data problem called co-occurrence bias. Co-occurrence bias can cause irrelevant details to get the attention of a machine learning model, leading to incorrect predictions. For example, if a dataset used to train an object recognition model only contains images of boats in the ocean, the model may start classifying anything related to the ocean, such as beaches or waves, as boats.
Finding Short Signals in Long Irregular Time Series with Continuous-Time Attention Policy Networks
Hartvigsen, Thomas, Thadajarassiri, Jidapa, Kong, Xiangnan, Rundensteiner, Elke
Irregularly-sampled time series (ITS) are native to high-impact domains like healthcare, where measurements are collected over time at uneven intervals. However, for many classification problems, only small portions of long time series are often relevant to the class label. In this case, existing ITS models often fail to classify long series since they rely on careful imputation, which easily over- or under-samples the relevant regions. Using this insight, we then propose CAT, a model that classifies multivariate ITS by explicitly seeking highly-relevant portions of an input series' timeline. CAT achieves this by integrating three components: (1) A Moment Network learns to seek relevant moments in an ITS's continuous timeline using reinforcement learning. (2) A Receptor Network models the temporal dynamics of both observations and their timing localized around predicted moments. (3) A recurrent Transition Model models the sequence of transitions between these moments, cultivating a representation with which the series is classified. Using synthetic and real data, we find that CAT outperforms ten state-of-the-art methods by finding short signals in long irregular time series.
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Explaining machine learning models for age classification in human gait analysis
Slijepcevic, Djordje, Horst, Fabian, Simak, Marvin, Lapuschkin, Sebastian, Raberger, Anna-Maria, Samek, Wojciech, Breiteneder, Christian, Schöllhorn, Wolfgang I., Zeppelzauer, Matthias, Horsak, Brian
Machine learning (ML) models have proven effective in classifying gait analysis data, e.g., binary classification of young vs. older adults. ML models, however, lack in providing human understandable explanations for their predictions. This "black-box" behavior impedes the understanding of which input features the model predictions are based on. We investigated an Explainable Artificial Intelligence method, i.e., Layer-wise Relevance Propagation (LRP), for gait analysis data. The research question was: Which input features are used by ML models to classify age-related differences in walking patterns? We utilized a subset of the AIST Gait Database 2019 containing five bilateral ground reaction force (GRF) recordings per person during barefoot walking of healthy participants. Each input signal was min-max normalized before concatenation and fed into a Convolutional Neural Network (CNN). Participants were divided into three age groups: young (20-39 years), middle-aged (40-64 years), and older (65-79 years) adults. The classification accuracy and relevance scores (derived using LRP) were averaged over a stratified ten-fold cross-validation. The mean classification accuracy of 60.1% was clearly higher than the zero-rule baseline of 37.3%. The confusion matrix shows that the CNN distinguished younger and older adults well, but had difficulty modeling the middle-aged adults.
Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness
Ray, Arijit, Cogswell, Michael, Lin, Xiao, Alipour, Kamran, Divakaran, Ajay, Yao, Yi, Burachas, Giedrius
Attention maps, a popular heatmap-based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we propose Error Maps that clarify the error by highlighting image regions where the model is prone to err. Error maps can indicate when a correctly attended region may be processed incorrectly leading to an incorrect answer, and hence, improve users' understanding of those cases. To evaluate our new explanations, we further introduce a metric that simulates users' interpretation of explanations to evaluate their potential helpfulness to understand model correctness. We finally conduct user studies to see that our new explanations help users understand model correctness better than baselines by an expected 30% and that our proxy helpfulness metrics correlate strongly ($\rho$>0.97) with how well users can predict model correctness.
A negative case analysis of visual grounding methods for VQA
Shrestha, Robik, Kafle, Kushal, Kanan, Christopher
Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.64)
- Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.50)
Collaborative Autonomy through Analogical Comic Graphs
Klenk, Matthew Evans (Palo Alto Research Center) | Mohan, Shiwali (Palo Alto Research Center) | Kleer, Johan de (Palo Alto Research Center) | Bobrow, Daniel G. (Palo Alto Research Center) | Hinrichs, Tom (Northwestern University) | Forbus, Ken (Northwestern University)
For more effective collaboration, users and autonomous systems should interact naturally. We propose that sketch-based interaction coupled with qualitative representations and analogy provides a natural interface for users and systems. We introduce comic graphs that capture tasks in terms of the temporal dynamics of the spatial configurations of relevant objects. This paper demonstrates, through a strategy simulation example, how these models could be learned by demonstration, transferred to new situations, and enable explanations.
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Illinois > Cook County > Evanston (0.04)