AITopics

Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (1) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (2) current fusion strategy of heterogeneous modalities like text and image is naive; (3) the localized answering is missing, which is crucial in complex surgical scenarios. In this paper, we propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction. To deal with the fusion of the heterogeneous modalities, we design gated vision-language embedding (GVLE) to build input patches for the Language Vision Transformer (LViT) to predict the answer. To get localization, we add the detection head in parallel with the prediction head of the LViT. We also integrate GIoU loss to boost localization performance by preserving the accuracy of the question-answering model. We annotate two datasets of VQLA by utilizing publicly available surgical videos from MICCAI challenges EndoVis-17 and 18. Our validation results suggest that Surgical-VQLA can better understand the surgical scene and localize the specific area related to the question-answering. GVLE presents an efficient language-vision embedding technique by showing superior performance over the existing benchmarks.

machine learning, natural language, question answering, (16 more...)

2305.11692

Country:

Asia > China > Hong Kong (0.05)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
(2 more...)

Genre: Research Report (0.70)

Industry:

Health & Medicine > Surgery (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

S$^3$HQA: A Three-Stage Approach for Multi-hop Text-Table Hybrid Question Answering

Lei, Fangyu, Li, Xiang, Wei, Yifan, He, Shizhu, Huang, Yiming, Zhao, Jun, Liu, Kang

Answering multi-hop questions over hybrid factual knowledge from the given text and table (TextTableQA) is a challenging task. Existing models mainly adopt a retriever-reader framework, which have several deficiencies, such as noisy labeling in training retriever, insufficient utilization of heterogeneous information over text and table, and deficient ability for different reasoning operations. In this paper, we propose a three-stage TextTableQA framework S3HQA, which comprises of retriever, selector, and reasoner. We use a retriever with refinement training to solve the noisy labeling problem. Then, a hybrid selector considers the linked relationships between heterogeneous data to select the most relevant factual knowledge. For the final stage, instead of adapting a reading comprehension module like in previous methods, we employ a generation-based reasoner to obtain answers. This includes two approaches: a row-wise generator and an LLM prompting generator~(first time used in this task). The experimental results demonstrate that our method achieves competitive results in the few-shot setting. When trained on the full dataset, our approach outperforms all baseline methods, ranking first on the HybridQA leaderboard.

large language model, machine learning, question answering, (19 more...)

2305.11725

Country:

Asia > China (0.04)
North America > United States > Tennessee (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Zhang, Xuanyu, Yang, Qing

Self-QA: Unsupervised Knowledge Guided Language Model Alignment

Large-scale language models like ChatGPT and GPT-4 have gained attention for their impressive conversational and generative capabilities. However, the creation of supervised paired question-answering data for instruction tuning presents formidable challenges. This endeavor necessitates substantial human effort for data annotation and wrestles with issues concerning data quality, diversity, accuracy, and other related factors. To overcome these obstacles, we introduce an innovative framework named Self-QA, which replaces the traditional practice of human-written instruction seeds with a vast amount of unsupervised knowledge, enabling the model to generate a larger quantity of correct and domain-specific instruction data. The effectiveness of our proposed method is demonstrated through experiments conducted on unsupervised corpora from various domains.

large language model, machine learning, question answering, (20 more...)

2305.11952

Country:

Asia > China > Beijing > Beijing (0.05)
North America > United States > Washington > King County > Redmond (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.49)

Bongini, Pietro, Becattini, Federico, Del Bimbo, Alberto

Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?

The use of Deep Learning and Computer Vision in the Cultural Heritage domain is becoming highly relevant in the last few years with lots of applications about audio smart guides, interactive museums and augmented reality. All these technologies require lots of data to work effectively and be useful for the user. In the context of artworks, such data is annotated by experts in an expensive and time consuming process. In particular, for each artwork, an image of the artwork and a description sheet have to be collected in order to perform common tasks like Visual Question Answering. In this paper we propose a method for Visual Question Answering that allows to generate at runtime a description sheet that can be used for answering both visual and contextual questions about the artwork, avoiding completely the image and the annotation process. For this purpose, we investigate on the use of GPT-3 for generating descriptions for artworks analyzing the quality of generated descriptions through captioning metrics. Finally we evaluate the performance for Visual Question Answering and captioning tasks.

large language model, machine learning, question answering, (19 more...)

2207.12101

Country:

Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.05)
Europe > Italy (0.04)
Asia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-18-2023

Learning to Generalize for Cross-domain QA

Niu, Yingjie, Yang, Linyi, Dong, Ruihai, Zhang, Yue

There have been growing concerns regarding the out-of-domain generalization ability of natural language processing (NLP) models, particularly in question-answering (QA) tasks. Current synthesized data augmentation methods for QA are hampered by increased training costs. To address this issue, we propose a novel approach that combines prompting methods and linear probing then fine-tuning strategy, which does not entail additional cost. Our method has been theoretically and empirically shown to be effective in enhancing the generalization ability of both generative and discriminative models. Our approach outperforms state-of-the-art baselines, with an average increase in F1 score of 4.5%-7.9%. Furthermore, our method can be easily integrated into any pre-trained models and offers a promising solution to the under-explored cross-domain QA task. We release our source code at GitHub*.

information retrieval, natural language, question answering, (21 more...)

2305.08208

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre:

Research Report > Promising Solution (0.54)
Research Report > New Finding (0.46)

Industry:

Consumer Products & Services > Restaurants (0.94)
Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.35)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

arXiv.org Artificial IntelligenceMay-17-2023

Motion Question Answering via Modular Motion Programs

Endo, Mark, Hsu, Joy, Li, Jiaman, Wu, Jiajun

In order to build artificial intelligence systems that can perceive and reason with human behavior in the real world, we must first design models that conduct complex spatio-temporal reasoning over motion sequences. Moving towards this goal, we propose the HumanMotionQA task to evaluate complex, multi-step reasoning abilities of models on long-form human motion sequences. We generate a dataset of question-answer pairs that require detecting motor cues in small portions of motion sequences, reasoning temporally about when events occur, and querying specific motion attributes. In addition, we propose NSPose, a neuro-symbolic method for this task that uses symbolic reasoning and a modular design to ground motion through learning motion concepts, attribute neural operators, and temporal relations. We demonstrate the suitability of NSPose for the HumanMotionQA task, outperforming all baseline methods.

motion sequence, nspose, proceedings, (14 more...)

2305.08953

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceMay-16-2023

xPQA: Cross-Lingual Product Question Answering across 12 Languages

Shen, Xiaoyu, Asai, Akari, Byrne, Bill, de Gispert, Adrià

Product Question Answering (PQA) systems are key in e-commerce applications to provide responses to customers' questions as they shop for products. While existing work on PQA focuses mainly on English, in practice there is need to support multiple customer languages while leveraging product information available in English. To study this practical industrial task, we present xPQA, a large-scale annotated cross-lingual PQA dataset in 12 languages across 9 branches, and report results in (1) candidate ranking, to select the best English candidate containing the information to answer a non-English question; and (2) answer generation, to generate a natural-sounding non-English answer based on the selected English candidate. We evaluate various approaches involving machine translation at runtime or offline, leveraging multilingual pre-trained LMs, and including or excluding xPQA training data. We find that (1) In-domain data is essential as cross-lingual rankers trained on other domains perform poorly on the PQA task; (2) Candidate ranking often prefers runtime-translation approaches while answer generation prefers multilingual approaches; (3) Translating offline to augment multilingual models helps candidate ranking mainly on languages with non-Latin scripts; and helps answer generation mainly on languages with Latin scripts. Still, there remains a significant performance gap between the English and the cross-lingual test sets.

information, natural language, question answering, (19 more...)

2305.09249

Country:

North America > United States (0.14)
Asia > India (0.05)
Europe > Italy (0.04)
(9 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.35)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

arXiv.org Artificial IntelligenceMay-15-2023

Defending Against Misinformation Attacks in Open-Domain Question Answering

Weller, Orion, Khan, Aleem, Weir, Nathaniel, Lawrie, Dawn, Van Durme, Benjamin

Recent work in open-domain question answering (ODQA) has shown that adversarial poisoning of the search collection can cause large drops in accuracy for production systems. However, little to no work has proposed methods to defend against these attacks. To do so, we rely on the intuition that redundant information often exists in large corpora. To find it, we introduce a method that uses query augmentation to search for a diverse set of passages that could answer the original question but are less likely to have been poisoned. We integrate these new passages into the model through the design of a novel confidence method, comparing the predicted answer to its appearance in the retrieved contexts (what we call \textit{Confidence from Answer Redundancy}, i.e. CAR). Together these methods allow for a simple but effective way to defend against poisoning attacks that provides gains of nearly 20\% exact match across varying levels of data poisoning/knowledge conflicts.

information retrieval, machine learning, question answering, (19 more...)

2212.10002

Country:

Asia > Middle East > Republic of Türkiye (0.05)
Africa > Kenya (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Government (0.94)
Health & Medicine (0.93)
Media > News (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.73)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

arXiv.org Artificial IntelligenceMay-15-2023

KEPR: Knowledge Enhancement and Plausibility Ranking for Generative Commonsense Question Answering

Li, Zhifeng, Zou, Bowei, Fan, Yifan, Hong, Yu

Generative commonsense question answering (GenCQA) is a task of automatically generating a list of answers given a question. The answer list is required to cover all reasonable answers. This presents the considerable challenges of producing diverse answers and ranking them properly. Incorporating a variety of closely-related background knowledge into the encoding of questions enables the generation of different answers. Meanwhile, learning to distinguish positive answers from negative ones potentially enhances the probabilistic estimation of plausibility, and accordingly, the plausibility-based ranking. Therefore, we propose a Knowledge Enhancement and Plausibility Ranking (KEPR) approach grounded on the Generate-Then-Rank pipeline architecture. Specifically, we expand questions in terms of Wiktionary commonsense knowledge of keywords, and reformulate them with normalized patterns. Dense passage retrieval is utilized for capturing relevant knowledge, and different PLM-based (BART, GPT2 and T5) networks are used for generating answers. On the other hand, we develop an ELECTRA-based answer ranking model, where logistic regression is conducted during training, with the aim of approximating different levels of plausibility in a polar classification scenario. Extensive experiments on the benchmark ProtoQA show that KEPR obtains substantial improvements, compared to the strong baselines. Within the experimental models, the T5-based GenCQA with KEPR obtains the best performance, which is up to 60.91% at the primary canonical metric Inc@3. It outperforms the existing GenCQA models on the current leaderboard of ProtoQA.

machine learning, natural language, question answering, (21 more...)

2305.08347

Country:

Asia > China (0.05)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

arXiv.org Artificial IntelligenceMay-15-2023

Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering

Lyu, Chenyang, Ji, Tianbo, Graham, Yvette, Foster, Jennifer

Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question. However, such schema would incur significant memory use and inevitably slow down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we concatenate video frames to a $n\times n$ matrix and then convert it to one image. By doing so, we reduce the use of the image encoder from $n^{2}$ to $1$ while maintaining the temporal structure of the original video. Experimental results on MSRVTT and TrafficQA show that our proposed approach achieves state-of-the-art performance with nearly $4\times$ faster speed and only 30% memory use. We show that by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up for training and inference. We believe the proposed approach can facilitate VideoQA-related research by reducing the computational requirements for those who have limited access to budgets and resources. Our code will be made publicly available for research use.

deep learning, efficient approach, machine learning, (5 more...)

2305.09107

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.60)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)