AITopics | natural language supervision

Collaborating Authors

natural language supervision

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Neural Information Processing SystemsJun-12-2026, 20:24:34 GMT

In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next (text) token prediction. Moreover, through extensive probing, we observe improved visual representation quality due to embedding optimization, underscoring the effectiveness of our probing setup. We demonstrate that our VisPer-LM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. In particular, VisPer-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)

Add feedback

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Zhong, Yifan, Bai, Fengshuo, Cai, Shaofei, Huang, Xuchuan, Chen, Zhang, Zhang, Xiaowei, Wang, Yuanfei, Guo, Shaoyang, Guan, Tianrui, Lui, Ka Nam, Qi, Zhiquan, Liang, Yitao, Chen, Yuanpei, Yang, Yaodong

arXiv.org Artificial IntelligenceJul-3-2025

The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.

large language model, machine learning, natural language, (24 more...)

arXiv.org Artificial Intelligence

2507.01925

Country:

North America (0.67)
Europe (0.67)
Asia > Japan (0.45)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment > Games > Computer Games (0.67)
Education (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
(8 more...)

Add feedback

Natural Language Supervision for Low-light Image Enhancement

Tang, Jiahui, Zhou, Kaihua, Luo, Zhijian, Hou, Yueen

arXiv.org Artificial IntelligenceJan-11-2025

With the development of deep learning, numerous methods for low-light image enhancement (LLIE) have demonstrated remarkable performance. Mainstream LLIE methods typically learn an end-to-end mapping based on pairs of low-light and normal-light images. However, normal-light images under varying illumination conditions serve as reference images, making it difficult to define a ``perfect'' reference image This leads to the challenge of reconciling metric-oriented and visual-friendly results. Recently, many cross-modal studies have found that side information from other related modalities can guide visual representation learning. Based on this, we introduce a Natural Language Supervision (NLS) strategy, which learns feature maps from text corresponding to images, offering a general and flexible interface for describing an image under different illumination. However, image distributions conditioned on textual descriptions are highly multimodal, which makes training difficult. To address this issue, we design a Textual Guidance Conditioning Mechanism (TCM) that incorporates the connections between image regions and sentence words, enhancing the ability to capture fine-grained cross-modal cues for images and text. This strategy not only utilizes a wider range of supervised sources, but also provides a new paradigm for LLIE based on visual and textual feature alignment. In order to effectively identify and merge features from various levels of image and textual information, we design an Information Fusion Attention (IFA) module to enhance different regions at different levels. We integrate the proposed TCM and IFA into a Natural Language Supervision network for LLIE, named NaLSuper. Finally, extensive experiments demonstrate the robustness and superior effectiveness of our proposed NaLSuper.

enhancement, machine learning, natural language, (12 more...)

arXiv.org Artificial Intelligence

2501.06546

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Kang, Gi-Cheon, Kim, Junghyun, Shim, Kyuhwan, Lee, Jun Ki, Zhang, Byoung-Tak

arXiv.org Artificial IntelligenceNov-1-2024

This paper explores how non-experts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how non-experts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., "move forward") and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2411.00508

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

BELT:Bootstrapping Electroencephalography-to-Language Decoding and Zero-Shot Sentiment Classification by Natural Language Supervision

Zhou, Jinzhao, Duan, Yiqun, Chang, Yu-Cheng, Wang, Yu-Kai, Lin, Chin-Teng

arXiv.org Artificial IntelligenceDec-9-2023

This paper presents BELT, a novel model and learning framework for the pivotal topic of brain-to-language translation research. The translation from noninvasive brain signals into readable natural language has the potential to promote the application scenario as well as the development of brain-computer interfaces (BCI) as a whole. The critical problem in brain signal decoding or brain-to-language translation is the acquisition of semantically appropriate and discriminative EEG representation from a dataset of limited scale and quality. The proposed BELT method is a generic and efficient framework that bootstraps EEG representation learning using off-the-shelf large-scale pretrained language models (LMs). With a large LM's capacity for understanding semantic information and zero-shot generalization, BELT utilizes large LMs trained on Internet-scale datasets to bring significant improvements to the understanding of EEG signals. In particular, the BELT model is composed of a deep conformer encoder and a vector quantization encoder. Semantical EEG representation is achieved by a contrastive learning step that provides natural language supervision. We achieve state-of-the-art results on two featuring brain decoding tasks including the brain-to-language translation and zero-shot sentiment classification. Specifically, our model surpasses the baseline model on both tasks by 5.45% and over 10% and archives a 42.31% BLEU-1 score and 67.32% precision on the main evaluation metrics for translation and zero-shot sentiment classification respectively.

decoding and zero-shot sentiment classification, natural language supervision

arXiv.org Artificial Intelligence

2309.12056

Genre: Research Report (1.00)

Industry:

Health & Medicine > Health Care Technology (0.89)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.80)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.80)
(2 more...)

Add feedback

Masked Autoencoding Does Not Help Natural Language Supervision at Scale

Weers, Floris, Shankar, Vaishaal, Katharopoulos, Angelos, Yang, Yinfei, Gunter, Tom

arXiv.org Artificial IntelligenceMay-15-2023

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2301.07836

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Transportation (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

#AAAI2023 workshops round-up 1: AI for credible elections, and responsible human-centric AI

AIHubMar-1-2023, 15:17:27 GMT

The AAAI 2023 Workshop on Representation Learning for Responsible Human-Centric AI (R2HCAI) brought together researchers who are broadly interested in representation learning for responsible human-centric AI. The goal of the workshop was to facilitate the development and adoption of AI systems that can enhance, augment, and improve the quality of human life. We had six inspiring invited talks from renowned researchers that covered a wide range of research in the field of responsible human-centric AI. Marzyeh Ghassemi gave a talk on designing machine learning processes for equitable health systems, while Daniel Ruckert shared their recent work on human-centered AI for medical imaging. Kathy Meier-Hellstern shared a framework for responsible AI for large models, and Jacob Andreas presented their research towards natural language supervision.

kathy meier-hellstern, representation, responsible human-centric ai, (11 more...)

AIHub

Industry: Health & Medicine > Health Care Providers & Services (0.60)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.89)

Add feedback

Three innovation areas in AI that everyone is fighting for

#artificialintelligenceJul-28-2022, 13:05:08 GMT

"In my youth, I would've argued that life is just a series of random events, devoid of any meaning. But as a data scientist, I must recognise that patterns sometimes emerge." When Gilfoyle, one of the main characters on the popular sitcom Silicon Valley said this, he could have as well extended this to patterns that emerge in the AI innovation space. It is an undeniable fact that whenever a new, popular and eye-grabbing tool comes to the market, tech companies rush to replicate them and create their own renditions. This gives birth to a certain trend – a pattern.

generation tool, language model, openai, (14 more...)

#artificialintelligence

Country: North America > United States > California (0.25)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.63)

Add feedback

Using CLIP to Classify Images without any Labels

#artificialintelligenceJul-6-2022, 17:30:28 GMT

Deep image classification models are typically trained in a supervised manner over a large, annotated dataset. Although a model's performance will improve as more annotated data becomes available, large-scale datasets for supervised learning are often difficult and expensive to obtain, requiring numerous hours of effort from expert annotators. With this in mind, one may begin to wonder if cheaper sources of supervision exist. Put simply, is it possible learn high-quality image classification models from data this is already publicly available? The proposal of Contrastive Language-Image Pre-Training (CLIP) model [1] -- recently re-popularized due to its use in the DALLE-2 model--by OpenAI answered this question in a positive fashion.

classification, dataset, representation, (14 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.77)

Add feedback

The top AI news of 2021

#artificialintelligenceDec-31-2021, 10:14:53 GMT

The year has introduced some gigantic AI models and GPT-3 competitors while witnessing regulatory crackdowns on big tech from countries worldwide. Some companies provided huge aids for the pandemic struck nations, and some went in other directions, like flying to space. The year only kept getting more interesting by the end. We've got you a timeline of the year, highlighting the most important updates of 2021 you should know. AI21 Labs released a language model that it claims is'the largest and most sophisticated language model ever released for general use by developers.'

metaverse, omniverse, openai, (15 more...)

#artificialintelligence

Country:

Europe (0.05)
Asia (0.05)

Industry:

Health & Medicine (1.00)
Information Technology > Services (0.97)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback