AITopics | Park, Jae Sung

Collaborating Authors

Park, Jae Sung

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Deitke, Matt, Clark, Christopher, Lee, Sangho, Tripathi, Rohun, Yang, Yue, Park, Jae Sung, Salehi, Mohammadreza, Muennighoff, Niklas, Lo, Kyle, Soldaini, Luca, Lu, Jiasen, Anderson, Taira, Bransom, Erin, Ehsani, Kiana, Ngo, Huong, Chen, YenSung, Patel, Ajay, Yatskar, Mark, Callison-Burch, Chris, Head, Andrew, Hendrix, Rose, Bastani, Favyen, VanderBilt, Eli, Lambert, Nathan, Chou, Yvonne, Chheda, Arnavi, Sparks, Jenna, Skjonsberg, Sam, Schmitz, Michael, Sarnat, Aaron, Bischoff, Byron, Walsh, Pete, Newell, Chris, Wolters, Piper, Gupta, Tanmay, Zeng, Kuo-Hao, Borchardt, Jon, Groeneveld, Dirk, Nam, Crystal, Lebrecht, Sophie, Wittlif, Caitlin, Schoenick, Carissa, Michel, Oscar, Krishna, Ranjay, Weihs, Luca, Smith, Noah A., Hajishirzi, Hannaneh, Girshick, Ross, Farhadi, Ali, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceDec-5-2024

Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2409.17146

Country: North America > United States > Texas > Harris County > Houston (0.14)

Genre:

Research Report (0.63)
Questionnaire & Opinion Survey (0.46)

Industry:

Consumer Products & Services (0.67)
Media (0.67)
Leisure & Entertainment (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Awadalla, Anas, Xue, Le, Shu, Manli, Yan, An, Wang, Jun, Purushwalkam, Senthil, Shen, Sheng, Lee, Hannah, Lo, Oscar, Park, Jae Sung, Guha, Etash, Savarese, Silvio, Schmidt, Ludwig, Choi, Yejin, Xiong, Caiming, Xu, Ran

arXiv.org Artificial IntelligenceNov-11-2024

Table 1: Comparison of open-source synthetic image-text datasets: We compare various datasets in terms of scale (number of samples), density (average number of words per sample), whether they are knowledge-augmented (meaning that the caption includes information found in image's web scraped alt-text), and the size of the captioning model used to generate the descriptions. For KALE, we create an initial pool of 100M captions from a 17B parameter model and use it to distill a 2B parameter model that matches the performance of the larger 17B model. We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that advances the state of knowledge-augmented image captioning. KALE builds upon recent work in this area, particularly CapsFusion [28], which pioneered the use of large language models to fuse synthetically generated captions with alt-text to incorporate real-world knowledge.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2411.07461

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness

Chandu, Khyathi Raghavi, Li, Linjie, Awadalla, Anas, Lu, Ximing, Park, Jae Sung, Hessel, Jack, Wang, Lijuan, Choi, Yejin

arXiv.org Artificial IntelligenceJul-2-2024

The ability to acknowledge the inevitable uncertainty in their knowledge and reasoning is a prerequisite for AI systems to be truly truthful and reliable. In this paper, we present a taxonomy of uncertainty specific to vision-language AI systems, distinguishing between epistemic uncertainty (arising from a lack of information) and aleatoric uncertainty (due to inherent unpredictability), and further explore finer categories within. Based on this taxonomy, we synthesize a benchmark dataset, CertainlyUncertain, featuring 178K visual question answering (VQA) samples as contrastive pairs. This is achieved by 1) inpainting images to make previously answerable questions into unanswerable ones; and 2) using image captions to prompt large language models for both answerable and unanswerable questions. Additionally, we introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error, to address the shortcomings of existing metrics.

accuracy, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2407.01942

Country:

Europe (1.00)
North America > United States > Texas (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Shen, Ethan, Fan, Alan, Pratt, Sarah M., Park, Jae Sung, Wallingford, Matthew, Kakade, Sham M., Holtzman, Ari, Krishna, Ranjay, Farhadi, Ali, Kusupati, Aditya

arXiv.org Artificial IntelligenceJun-24-2024

Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the language model. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k^2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.

large language model, machine learning, superposed decoding, (17 more...)

arXiv.org Artificial Intelligence

2405.184

Country:

North America > United States > California (0.14)
North America > Mexico > Veracruz (0.14)
Asia > India > Maharashtra (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area (0.69)
Education (0.69)
Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Agent AI: Surveying the Horizons of Multimodal Interaction

Durante, Zane, Huang, Qiuyuan, Wake, Naoki, Gong, Ran, Park, Jae Sung, Sarkar, Bidipta, Taori, Rohan, Noda, Yusuke, Terzopoulos, Demetri, Choi, Yejin, Ikeuchi, Katsushi, Vo, Hoi, Fei-Fei, Li, Gao, Jianfeng

arXiv.org Artificial IntelligenceJan-25-2024

Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define "Agent AI" as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.

knowledge management, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2401.03568

Country:

Asia > Japan (0.67)
North America > United States > California (0.46)

Genre:

Workflow (1.00)
Research Report > Promising Solution (1.00)

Industry:

Transportation (1.00)
Leisure & Entertainment > Games > Computer Games (1.00)
Law (1.00)
(7 more...)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(7 more...)

Add feedback

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Park, Jae Sung, Hessel, Jack, Chandu, Khyathi Raghavi, Liang, Paul Pu, Lu, Ximing, West, Peter, Yu, Youngjae, Huang, Qiuyuan, Gao, Jianfeng, Farhadi, Ali, Choi, Yejin

arXiv.org Artificial IntelligenceDec-12-2023

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.

answer, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2312.04837

Country:

North America > United States (0.67)
Asia > Middle East > Israel (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

Hessel, Jack, Hwang, Jena D., Park, Jae Sung, Zellers, Rowan, Bhagavatula, Chandra, Rohrbach, Anna, Saenko, Kate, Choi, Yejin

arXiv.org Artificial IntelligenceJul-25-2022

Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a "20 mph" sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning? We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50x64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/

abductive reasoning, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2202.048

Country: North America > United States (1.00)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (0.93)
Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Abductive Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback