AITopics

Country: Asia > Middle East > Israel (0.14)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Neural Information Processing SystemsFeb-9-2026, 17:47:27 GMT

A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()

We expand Eq. 10 as: q There are three KL divergence terms in our training objective ELBO (Eq. Medium and Y elp Large datasets, we follow (Guu et al., 2018) to use a three-layer attentional LSTM Skip connections are also used between adjacent LSTM layers. We apply annealing and free-bits techniques following (Li et al., 2019) to the KL term on prototype variable, As in Section 4.3, here we show more generated examples through interpolation on MSCOCO dataset. Table 6: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype.

artificial intelligence, machine learning, prototype, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsFeb-7-2026, 20:25:29 GMT

24bea84d52e6a1f8025e313c2ffff50a-Supplemental.pdf

caption, co-cv ae, surfboard, (14 more...)

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
North America > Canada (0.04)

Industry: Leisure & Entertainment > Sports > Snowboard (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Wüst, Antonia, Stammer, Wolfgang, Shindo, Hikaru, Helff, Lukas, Dhami, Devendra Singh, Kersting, Kristian

Synthesizing Visual Concepts as Vision-Language Programs

arXiv.org Artificial IntelligenceNov-25-2025

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.

large language model, logic & formal reasoning, machine learning, (20 more...)

2511.18964

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

New ScientistOct-10-2025, 00:30:02 GMT

Robotic underwater glider sets out to circumnavigate the globe

Redwing, a robotic submarine about the size of a surfboard, is embarking on a five-year journey that will follow the famed explorer Ferdinand Magellan's voyage around the world A small robot submarine is setting out to go around the world for the first time. Teledyne Marine and Rutgers University New Brunswick in New Jersey are launching an underwater glider called Redwing on its Sentinel Mission from Martha's Vineyard in Massachusetts on 11 October. Researchers have been using underwater gliders since the 1990s. Rather than a propeller, gliders have a buoyancy engine, a gas-filled piston that slightly changes the craft's overall buoyancy. An electric motor pushes the piston in to make the glider heavier than water so it slowly sinks, coasting downwards at a shallow angle.

glider, redwing, underwater glider, (14 more...)

New Scientist

Country:

North America > United States > New Jersey (0.25)
North America > United States > Massachusetts (0.25)
South America > Falkland Islands (0.05)
(9 more...)

Industry:

Transportation > Passenger (1.00)
Transportation > Air (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.49)
Health & Medicine > Therapeutic Area > Gastroenterology (0.30)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)

Neural Information Processing SystemsOct-2-2025, 11:57:46 GMT

Diverse Image Captioning with Context Object Split Latent Spaces

The word dimension for the embedding layer is 300. In Tab. 7 we further evaluate the diversity of COS-CVAE using self-CIDEr We provide additional qualitative results in Tabs. In Tab. 12 we show the divserse captions for novel objects generated by our model and the regions The evaluation server for nocaps accepts only one caption per image and does not support methods modeling one-to-many relationships for images and captions. In Figure 1 (left) we show the average accuracy and diversity scores again averaged across annotators; in Figure 1 (right) we show the accuracy and diversity scores from each annotator. We find that the captions generated by the COS-CV AE are scored to be more accurate compared to COS-CV AE (paired).

artificial intelligence, machine learning, surfboard, (16 more...)

Industry: Leisure & Entertainment > Sports > Snowboard (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsAug-15-2025, 16:14:10 GMT

A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()

dataset, derivation, prototype, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJun-10-2024

BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models

Huang, Wanaiu

Semantic information is vital for human interaction, and decoding it from brain activity enables non-invasive clinical augmentative and alternative communication. While there has been significant progress in reconstructing visual images, few studies have focused on the language aspect. To address this gap, leveraging the powerful capabilities of the decoder-based vision-language pretrained model CoCa, this paper proposes BrainChat, a simple yet effective generative framework aimed at rapidly accomplishing semantic information decoding tasks from brain activity, including fMRI question answering and fMRI captioning. BrainChat employs the self-supervised approach of Masked Brain Modeling to encode sparse fMRI data, obtaining a more compact embedding representation in the latent space. Subsequently, BrainChat bridges the gap between modalities by applying contrastive loss, resulting in aligned representations of fMRI, image, and text embeddings. Furthermore, the fMRI embeddings are mapped to the generative Brain Decoder via cross-attention layers, where they guide the generation of textual content about fMRI in a regressive manner by minimizing caption loss. Empirically, BrainChat exceeds the performance of existing state-of-the-art methods in the fMRI captioning task and, for the first time, implements fMRI question answering. Additionally, BrainChat is highly flexible and can achieve high performance without image data, making it better suited for real-world scenarios with limited data.

brainchat, fmri, fmri data, (14 more...)

2406.07584

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Zhejiang Province > Hangzhou (0.05)
Africa > Togo (0.04)

Genre: Research Report (0.70)

Industry:

Health & Medicine > Health Care Technology (1.00)
Leisure & Entertainment > Sports > Tennis (0.94)
Health & Medicine > Therapeutic Area > Neurology (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.82)

arXiv.org Artificial IntelligenceFeb-19-2024

Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection

Chen, Ruibo, Wu, Yihan, Chen, Lichang, Liu, Guodong, He, Qi, Xiong, Tianyi, Liu, Chenxi, Guo, Junfeng, Huang, Heng

Data selection in instruction tuning emerges as a pivotal process for acquiring high-quality data and training instruction-following large language models (LLMs), but it is still a new and unexplored research area for vision-language models (VLMs). Existing data selection approaches on LLMs either rely on single unreliable scores, or use downstream tasks for selection, which is time-consuming and can lead to potential over-fitting on the chosen evaluation datasets. To address this challenge, we introduce a novel dataset selection method, Self-Filter, that utilizes the VLM itself as a filter. This approach is inspired by the observation that VLMs benefit from training with the most challenging instructions. Self-Filter operates in two stages. In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM. In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity. Comprehensive experiments on LLaVA and MiniGPT-4 show that Self-Filter can reach better results compared to full data settings with merely about 15% samples, and can achieve superior performance against competitive baselines.

dataset, instruction, surfer, (16 more...)

2402.12501

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Africa > Rwanda > Kigali > Kigali (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJan-12-2024

Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models

Lu, Jiaying, Rao, Jinmeng, Chen, Kezhen, Guo, Xiaoyuan, Zhang, Yawen, Sun, Baochen, Yang, Carl, Yang, Jie

Large Vision-Language Models (LVLMs) offer remarkable benefits for a variety of vision-language tasks. However, a challenge hindering their application in real-world scenarios, particularly regarding safety, robustness, and reliability, is their constrained semantic grounding ability, which pertains to connecting language to the physical-world entities or concepts referenced in images. Therefore, a crucial need arises for a comprehensive study to assess the semantic grounding ability of widely used LVLMs. Despite the significance, sufficient investigation in this direction is currently lacking. Our work bridges this gap by designing a pipeline for generating large-scale evaluation datasets covering fine-grained semantic information, such as color, number, material, etc., along with a thorough assessment of seven popular LVLMs' semantic grounding ability. Results highlight prevalent misgrounding across various aspects and degrees. To address this issue, we propose a data-centric enhancement method that aims to improve LVLMs' semantic grounding ability through multimodal instruction tuning on fine-grained conversations. Experiments on enhanced LVLMs demonstrate notable improvements in addressing misgrounding issues.

arxiv preprint arxiv, instruction, lvlm, (13 more...)

2309.04041

Country: Asia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)