AITopics | Joshi, Mandar

Collaborating Authors

Joshi, Mandar

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BAGEL: Bootstrapping Agents by Guiding Exploration with Language

Murty, Shikhar, Manning, Christopher, Shaw, Peter, Joshi, Mandar, Lee, Kenton

arXiv.org Artificial IntelligenceJun-8-2024

Following natural language instructions by executing actions in digital environments (e.g. web-browsers and REST APIs) is a challenging task for language model (LM) agents. Unfortunately, LM agents often fail to generalize to new environments without human demonstrations. This work presents BAGEL, a method for bootstrapping LM agents without human supervision. BAGEL converts a seed set of randomly explored trajectories or synthetic instructions, into demonstrations, via round-trips between two noisy LM components: an LM labeler which converts a trajectory into a synthetic instruction, and a zero-shot LM agent which maps the synthetic instruction into a refined trajectory. By performing these round-trips iteratively, BAGEL quickly converts the initial distribution of trajectories towards those that are well-described by natural language. We use BAGEL demonstrations to adapt a zero shot LM agent at test time via in-context learning over retrieved demonstrations, and find improvements of over 2-13% absolute on ToolQA and MiniWob++, with up to 13x reduction in execution failures.

demonstration, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2403.0814

Country:

North America (0.46)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.62)

Add feedback

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Shaw, Peter, Joshi, Mandar, Cohan, James, Berant, Jonathan, Pasupat, Panupong, Hu, Hexiang, Khandelwal, Urvashi, Lee, Kenton, Toutanova, Kristina

arXiv.org Artificial IntelligenceDec-6-2023

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

demonstration, machine learning, reinforcement learning, (21 more...)

arXiv.org Artificial Intelligence

2306.00245

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (0.67)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Graphics (1.00)
(5 more...)

Add feedback

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Zhu, Wang, Agarwal, Alekh, Joshi, Mandar, Jia, Robin, Thomason, Jesse, Toutanova, Kristina

arXiv.org Artificial IntelligenceNov-16-2023

Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.09612

Country:

Asia (1.00)
Europe (0.93)
North America > United States > California (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Energy > Oil & Gas (0.93)
Education (0.88)
Government > Regional Government > North America Government > United States Government (0.46)
Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)

Add feedback

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Lee, Kenton, Joshi, Mandar, Turc, Iulia, Hu, Hexiang, Liu, Fangyu, Eisenschlos, Julian, Khandelwal, Urvashi, Shaw, Peter, Chang, Ming-Wei, Toutanova, Kristina

arXiv.org Artificial IntelligenceJun-15-2023

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

artificial intelligence, natural language, pix2struct, (18 more...)

arXiv.org Artificial Intelligence

2210.03347

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (0.93)
Health & Medicine > Consumer Health (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.44)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.34)

Add feedback

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Chen, Xi, Djolonga, Josip, Padlewski, Piotr, Mustafa, Basil, Changpinyo, Soravit, Wu, Jialin, Ruiz, Carlos Riquelme, Goodman, Sebastian, Wang, Xiao, Tay, Yi, Shakeri, Siamak, Dehghani, Mostafa, Salz, Daniel, Lucic, Mario, Tschannen, Michael, Nagrani, Arsha, Hu, Hexiang, Joshi, Mandar, Pang, Bo, Montgomery, Ceslee, Pietrzyk, Paulina, Ritter, Marvin, Piergiovanni, AJ, Minderer, Matthias, Pavetic, Filip, Waters, Austin, Li, Gang, Alabdulmohsin, Ibrahim, Beyer, Lucas, Amelot, Julien, Lee, Kenton, Steiner, Andreas Peter, Li, Yang, Keysers, Daniel, Arnab, Anurag, Xu, Yuanzhong, Rong, Keran, Kolesnikov, Alexander, Seyedhosseini, Mojtaba, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, Soricut, Radu

arXiv.org Artificial IntelligenceMay-29-2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2305.18565

Country: North America > United States > Louisiana (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.92)
Media (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.55)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.46)
(2 more...)

Add feedback

DePlot: One-shot visual language reasoning by plot-to-table translation

Liu, Fangyu, Eisenschlos, Julian Martin, Piccinno, Francesco, Krichene, Syrine, Pang, Chenxi, Lee, Kenton, Joshi, Mandar, Chen, Wenhu, Collier, Nigel, Altun, Yasemin

arXiv.org Artificial IntelligenceMay-23-2023

Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2212.10505

Country:

Europe > Ireland (0.14)
Europe > Croatia (0.14)
North America > United States (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Liu, Fangyu, Piccinno, Francesco, Krichene, Syrine, Pang, Chenxi, Lee, Kenton, Joshi, Mandar, Altun, Yasemin, Collier, Nigel, Eisenschlos, Julian Martin

arXiv.org Artificial IntelligenceMay-23-2023

Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2212.09662

Country:

Europe (0.68)
Asia > Middle East (0.46)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.69)

Technology:

Information Technology > Visual Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Improving Passage Retrieval with Zero-Shot Question Generation

Sachan, Devendra Singh, Lewis, Mike, Joshi, Mandar, Aghajanyan, Armen, Yih, Wen-tau, Pineau, Joelle, Zettlemoyer, Luke

arXiv.org Artificial IntelligenceApr-2-2023

Queries and documents of query scoring with count-based language are typically embedded in a shared representation models (Zhai and Lafferty, 2001). However, instead space to enable efficient search, before using of estimating a language model from each a task-specific model to perform a deeper, tokenlevel passage, UPR uses pre-trained language models document analysis (e.g. a document reader (PLMs). More recent work on re-rankers have finetuned that selects an answer span). We show that adding PLMs on question-passage pairs to generate a zero-shot re-ranker to the retrieval stage of such relevance labels (Nogueira et al., 2020), sometimes models leads to large gains in performance, by doing to jointly generate question and relevance deep token-level analysis with no task-specific labels (Nogueira dos Santos et al., 2020; Ju et al., data or tuning.

artificial intelligence, natural language, question answering, (19 more...)

arXiv.org Artificial Intelligence

2204.07496

Country:

North America > United States > Indiana (0.14)
North America > Canada > Quebec (0.14)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Sports > Baseball (1.00)
Media > Film (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Hu, Hexiang, Luan, Yi, Chen, Yang, Khandelwal, Urvashi, Joshi, Mandar, Lee, Kenton, Toutanova, Kristina, Chang, Ming-Wei

arXiv.org Artificial IntelligenceFeb-23-2023

Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks. However, existing image classification benchmarks often evaluate recognition on a specific domain (e.g., outdoor images) or a specific task (e.g., classifying plant species), which falls short of evaluating whether pre-trained foundational models are universal visual recognizers. To address this, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. We also find existing pretrained models yield different strengths: while PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2302.11154

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Industry:

Transportation > Air (0.67)
Transportation > Passenger (0.67)
Aerospace & Defense > Aircraft (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

Add feedback

DESCGEN: A Distantly Supervised Dataset for Generating Abstractive Entity Descriptions

Shi, Weijia, Joshi, Mandar, Zettlemoyer, Luke

arXiv.org Artificial IntelligenceJun-16-2021

Short textual descriptions of entities provide summaries of their key attributes and have been shown to be useful sources of background knowledge for tasks such as entity linking and question answering. However, generating entity descriptions, especially for new and long-tail entities, can be challenging since relevant information is often scattered across multiple sources with varied content and style. We introduce DESCGEN: given mentions spread over multiple documents, the goal is to generate an entity summary description. DESCGEN consists of 37K entity descriptions from Wikipedia and Fandom, each paired with nine evidence documents on average. The documents were collected using a combination of entity linking and hyperlinks to the Wikipedia and Fandom entity pages, which together provide high-quality distant supervision. The resulting summaries are more abstractive than those found in existing datasets and provide a better proxy for the challenge of describing new and emerging entities. We also propose a two-stage extract-then-generate baseline and show that there exists a large gap (19.9% in ROUGE-L) between state-of-the-art models and human performance, suggesting that the data will support significant future work.

artificial intelligence, entity description, natural language, (16 more...)

arXiv.org Artificial Intelligence

2106.05365

Country:

Europe (1.00)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.94)

Add feedback