surfboard
A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()
We expand Eq. 10 as: q There are three KL divergence terms in our training objective ELBO (Eq. Medium and Y elp Large datasets, we follow (Guu et al., 2018) to use a three-layer attentional LSTM Skip connections are also used between adjacent LSTM layers. We apply annealing and free-bits techniques following (Li et al., 2019) to the KL term on prototype variable, As in Section 4.3, here we show more generated examples through interpolation on MSCOCO dataset. Table 6: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype.
Diverse Image Captioning with Context Object Split Latent Spaces
The word dimension for the embedding layer is 300. In Tab. 7 we further evaluate the diversity of COS-CVAE using self-CIDEr We provide additional qualitative results in Tabs. In Tab. 12 we show the divserse captions for novel objects generated by our model and the regions The evaluation server for nocaps accepts only one caption per image and does not support methods modeling one-to-many relationships for images and captions. In Figure 1 (left) we show the average accuracy and diversity scores again averaged across annotators; in Figure 1 (right) we show the accuracy and diversity scores from each annotator. We find that the captions generated by the COS-CV AE are scored to be more accurate compared to COS-CV AE (paired).
- North America > Canada > Alberta (0.05)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
Robotic underwater glider sets out to circumnavigate the globe
Redwing, a robotic submarine about the size of a surfboard, is embarking on a five-year journey that will follow the famed explorer Ferdinand Magellan's voyage around the world A small robot submarine is setting out to go around the world for the first time. Teledyne Marine and Rutgers University New Brunswick in New Jersey are launching an underwater glider called Redwing on its Sentinel Mission from Martha's Vineyard in Massachusetts on 11 October. Researchers have been using underwater gliders since the 1990s. Rather than a propeller, gliders have a buoyancy engine, a gas-filled piston that slightly changes the craft's overall buoyancy. An electric motor pushes the piston in to make the glider heavier than water so it slowly sinks, coasting downwards at a shallow angle.
- North America > United States > New Jersey (0.25)
- North America > United States > Massachusetts (0.25)
- South America > Falkland Islands (0.05)
- (9 more...)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.49)
- Health & Medicine > Therapeutic Area > Gastroenterology (0.30)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
Diverse Image Captioning with Context Object Split Latent Spaces
The word dimension for the embedding layer is 300. In Tab. 7 we further evaluate the diversity of COS-CVAE using self-CIDEr We provide additional qualitative results in Tabs. In Tab. 12 we show the divserse captions for novel objects generated by our model and the regions The evaluation server for nocaps accepts only one caption per image and does not support methods modeling one-to-many relationships for images and captions. In Figure 1 (left) we show the average accuracy and diversity scores again averaged across annotators; in Figure 1 (right) we show the accuracy and diversity scores from each annotator. We find that the captions generated by the COS-CV AE are scored to be more accurate compared to COS-CV AE (paired).
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
- North America > Canada (0.04)
A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()
We expand Eq. 10 as: q There are three KL divergence terms in our training objective ELBO (Eq. Medium and Y elp Large datasets, we follow (Guu et al., 2018) to use a three-layer attentional LSTM Skip connections are also used between adjacent LSTM layers. We apply annealing and free-bits techniques following (Li et al., 2019) to the KL term on prototype variable, As in Section 4.3, here we show more generated examples through interpolation on MSCOCO dataset. Table 6: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype.
BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models
Semantic information is vital for human interaction, and decoding it from brain activity enables non-invasive clinical augmentative and alternative communication. While there has been significant progress in reconstructing visual images, few studies have focused on the language aspect. To address this gap, leveraging the powerful capabilities of the decoder-based vision-language pretrained model CoCa, this paper proposes BrainChat, a simple yet effective generative framework aimed at rapidly accomplishing semantic information decoding tasks from brain activity, including fMRI question answering and fMRI captioning. BrainChat employs the self-supervised approach of Masked Brain Modeling to encode sparse fMRI data, obtaining a more compact embedding representation in the latent space. Subsequently, BrainChat bridges the gap between modalities by applying contrastive loss, resulting in aligned representations of fMRI, image, and text embeddings. Furthermore, the fMRI embeddings are mapped to the generative Brain Decoder via cross-attention layers, where they guide the generation of textual content about fMRI in a regressive manner by minimizing caption loss. Empirically, BrainChat exceeds the performance of existing state-of-the-art methods in the fMRI captioning task and, for the first time, implements fMRI question answering. Additionally, BrainChat is highly flexible and can achieve high performance without image data, making it better suited for real-world scenarios with limited data.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Zhejiang Province > Hangzhou (0.05)
- North America > Canada > Alberta (0.04)
- Africa > Togo (0.04)
- Health & Medicine > Health Care Technology (1.00)
- Leisure & Entertainment > Sports > Tennis (0.94)
- Health & Medicine > Therapeutic Area > Neurology (0.88)
Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection
Chen, Ruibo, Wu, Yihan, Chen, Lichang, Liu, Guodong, He, Qi, Xiong, Tianyi, Liu, Chenxi, Guo, Junfeng, Huang, Heng
Data selection in instruction tuning emerges as a pivotal process for acquiring high-quality data and training instruction-following large language models (LLMs), but it is still a new and unexplored research area for vision-language models (VLMs). Existing data selection approaches on LLMs either rely on single unreliable scores, or use downstream tasks for selection, which is time-consuming and can lead to potential over-fitting on the chosen evaluation datasets. To address this challenge, we introduce a novel dataset selection method, Self-Filter, that utilizes the VLM itself as a filter. This approach is inspired by the observation that VLMs benefit from training with the most challenging instructions. Self-Filter operates in two stages. In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM. In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity. Comprehensive experiments on LLaVA and MiniGPT-4 show that Self-Filter can reach better results compared to full data settings with merely about 15% samples, and can achieve superior performance against competitive baselines.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Africa > Rwanda > Kigali > Kigali (0.04)
- (8 more...)
Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models
Lu, Jiaying, Rao, Jinmeng, Chen, Kezhen, Guo, Xiaoyuan, Zhang, Yawen, Sun, Baochen, Yang, Carl, Yang, Jie
Large Vision-Language Models (LVLMs) offer remarkable benefits for a variety of vision-language tasks. However, a challenge hindering their application in real-world scenarios, particularly regarding safety, robustness, and reliability, is their constrained semantic grounding ability, which pertains to connecting language to the physical-world entities or concepts referenced in images. Therefore, a crucial need arises for a comprehensive study to assess the semantic grounding ability of widely used LVLMs. Despite the significance, sufficient investigation in this direction is currently lacking. Our work bridges this gap by designing a pipeline for generating large-scale evaluation datasets covering fine-grained semantic information, such as color, number, material, etc., along with a thorough assessment of seven popular LVLMs' semantic grounding ability. Results highlight prevalent misgrounding across various aspects and degrees. To address this issue, we propose a data-centric enhancement method that aims to improve LVLMs' semantic grounding ability through multimodal instruction tuning on fine-grained conversations. Experiments on enhanced LVLMs demonstrate notable improvements in addressing misgrounding issues.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Park, Jae Sung, Hessel, Jack, Chandu, Khyathi Raghavi, Liang, Paul Pu, Lu, Ximing, West, Peter, Yu, Youngjae, Huang, Qiuyuan, Gao, Jianfeng, Farhadi, Ali, Choi, Yejin
Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (5 more...)
- Leisure & Entertainment (0.67)
- Health & Medicine (0.46)
CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images
We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D in a self-supervised way. This is a challenging task, as there exist specific manifolds of the interactions that can be considered human-like and natural, but the human pose and the geometry of objects can vary even for similar interactions. Such diversity makes the annotating task of 3D interactions difficult and hard to scale, which limits the potential to reason about that in a supervised way. One way of learning the 3D spatial relationship between humans and objects during interaction is by showing multiple 2D images captured from different viewpoints when humans interact with the same type of objects. The core idea of our method is to leverage a generative model that produces high-quality 2D images from an arbitrary text prompt input as an "unbounded" data generator with effective controllability and view diversity. Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations. We present multiple strategies to leverage the synthesized images, including (1) the first method to leverage a generative image model for 3D human-object spatial relation learning; (2) a framework to reason about the 3D spatial relations from inconsistent 2D cues in a self-supervised manner via 3D occupancy reasoning with pose canonicalization; (3) semantic clustering to disambiguate different types of interactions with the same object types; and (4) a novel metric to assess the quality of 3D spatial learning of interaction.
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment > Sports (0.93)