natural language supervision
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Zhong, Yifan, Bai, Fengshuo, Cai, Shaofei, Huang, Xuchuan, Chen, Zhang, Zhang, Xiaowei, Wang, Yuanfei, Guo, Shaoyang, Guan, Tianrui, Lui, Ka Nam, Qi, Zhiquan, Liang, Yitao, Chen, Yuanpei, Yang, Yaodong
The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.
Natural Language Supervision for Low-light Image Enhancement
Tang, Jiahui, Zhou, Kaihua, Luo, Zhijian, Hou, Yueen
With the development of deep learning, numerous methods for low-light image enhancement (LLIE) have demonstrated remarkable performance. Mainstream LLIE methods typically learn an end-to-end mapping based on pairs of low-light and normal-light images. However, normal-light images under varying illumination conditions serve as reference images, making it difficult to define a ``perfect'' reference image This leads to the challenge of reconciling metric-oriented and visual-friendly results. Recently, many cross-modal studies have found that side information from other related modalities can guide visual representation learning. Based on this, we introduce a Natural Language Supervision (NLS) strategy, which learns feature maps from text corresponding to images, offering a general and flexible interface for describing an image under different illumination. However, image distributions conditioned on textual descriptions are highly multimodal, which makes training difficult. To address this issue, we design a Textual Guidance Conditioning Mechanism (TCM) that incorporates the connections between image regions and sentence words, enhancing the ability to capture fine-grained cross-modal cues for images and text. This strategy not only utilizes a wider range of supervised sources, but also provides a new paradigm for LLIE based on visual and textual feature alignment. In order to effectively identify and merge features from various levels of image and textual information, we design an Information Fusion Attention (IFA) module to enhance different regions at different levels. We integrate the proposed TCM and IFA into a Natural Language Supervision network for LLIE, named NaLSuper. Finally, extensive experiments demonstrate the robustness and superior effectiveness of our proposed NaLSuper.
CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision
Kang, Gi-Cheon, Kim, Junghyun, Shim, Kyuhwan, Lee, Jun Ki, Zhang, Byoung-Tak
This paper explores how non-experts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how non-experts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., "move forward") and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).
BELT:Bootstrapping Electroencephalography-to-Language Decoding and Zero-Shot Sentiment Classification by Natural Language Supervision
Zhou, Jinzhao, Duan, Yiqun, Chang, Yu-Cheng, Wang, Yu-Kai, Lin, Chin-Teng
This paper presents BELT, a novel model and learning framework for the pivotal topic of brain-to-language translation research. The translation from noninvasive brain signals into readable natural language has the potential to promote the application scenario as well as the development of brain-computer interfaces (BCI) as a whole. The critical problem in brain signal decoding or brain-to-language translation is the acquisition of semantically appropriate and discriminative EEG representation from a dataset of limited scale and quality. The proposed BELT method is a generic and efficient framework that bootstraps EEG representation learning using off-the-shelf large-scale pretrained language models (LMs). With a large LM's capacity for understanding semantic information and zero-shot generalization, BELT utilizes large LMs trained on Internet-scale datasets to bring significant improvements to the understanding of EEG signals. In particular, the BELT model is composed of a deep conformer encoder and a vector quantization encoder. Semantical EEG representation is achieved by a contrastive learning step that provides natural language supervision. We achieve state-of-the-art results on two featuring brain decoding tasks including the brain-to-language translation and zero-shot sentiment classification. Specifically, our model surpasses the baseline model on both tasks by 5.45% and over 10% and archives a 42.31% BLEU-1 score and 67.32% precision on the main evaluation metrics for translation and zero-shot sentiment classification respectively.
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Weers, Floris, Shankar, Vaishaal, Katharopoulos, Angelos, Yang, Yinfei, Gunter, Tom
Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.
#AAAI2023 workshops round-up 1: AI for credible elections, and responsible human-centric AI
The AAAI 2023 Workshop on Representation Learning for Responsible Human-Centric AI (R2HCAI) brought together researchers who are broadly interested in representation learning for responsible human-centric AI. The goal of the workshop was to facilitate the development and adoption of AI systems that can enhance, augment, and improve the quality of human life. We had six inspiring invited talks from renowned researchers that covered a wide range of research in the field of responsible human-centric AI. Marzyeh Ghassemi gave a talk on designing machine learning processes for equitable health systems, while Daniel Ruckert shared their recent work on human-centered AI for medical imaging. Kathy Meier-Hellstern shared a framework for responsible AI for large models, and Jacob Andreas presented their research towards natural language supervision.
Three innovation areas in AI that everyone is fighting for
"In my youth, I would've argued that life is just a series of random events, devoid of any meaning. But as a data scientist, I must recognise that patterns sometimes emerge." When Gilfoyle, one of the main characters on the popular sitcom Silicon Valley said this, he could have as well extended this to patterns that emerge in the AI innovation space. It is an undeniable fact that whenever a new, popular and eye-grabbing tool comes to the market, tech companies rush to replicate them and create their own renditions. This gives birth to a certain trend – a pattern.
Using CLIP to Classify Images without any Labels
Deep image classification models are typically trained in a supervised manner over a large, annotated dataset. Although a model's performance will improve as more annotated data becomes available, large-scale datasets for supervised learning are often difficult and expensive to obtain, requiring numerous hours of effort from expert annotators. With this in mind, one may begin to wonder if cheaper sources of supervision exist. Put simply, is it possible learn high-quality image classification models from data this is already publicly available? The proposal of Contrastive Language-Image Pre-Training (CLIP) model [1] -- recently re-popularized due to its use in the DALLE-2 model--by OpenAI answered this question in a positive fashion.
The top AI news of 2021
The year has introduced some gigantic AI models and GPT-3 competitors while witnessing regulatory crackdowns on big tech from countries worldwide. Some companies provided huge aids for the pandemic struck nations, and some went in other directions, like flying to space. The year only kept getting more interesting by the end. We've got you a timeline of the year, highlighting the most important updates of 2021 you should know. AI21 Labs released a language model that it claims is'the largest and most sophisticated language model ever released for general use by developers.'
Technical Language Supervision for Intelligent Fault Diagnosis in Process Industry
Löwenmark, Karl, Taal, Cees, Schnabel, Stephan, Liwicki, Marcus, Sandin, Fredrik
In the process industry, condition monitoring systems with automated fault diagnosis methods assisthuman experts and thereby improve maintenance efficiency, process sustainability, and workplace safety.Improving the automated fault diagnosis methods using data and machine learning-based models is a centralaspect of intelligent fault diagnosis (IFD). A major challenge in IFD is to develop realistic datasets withaccurate labels needed to train and validate models, and to transfer models trained with labeled lab datato heterogeneous process industry environments. However, fault descriptions and work-orders written bydomain experts are increasingly digitized in modern condition monitoring systems, for example in the contextof rotating equipment monitoring. Thus, domain-specific knowledge about fault characteristics and severitiesexists as technical language annotations in industrial datasets. Furthermore, recent advances in naturallanguage processing enable weakly supervised model optimization using natural language annotations, mostnotably in the form ofnatural language supervision(NLS). This creates a timely opportunity to developtechnical language supervision(TLS) solutions for IFD systems grounded in industrial data, for exampleas a complement to pre-training with lab data to address problems like overfitting and inaccurate out-of-sample generalisation. We surveyed the literature and identify a considerable improvement in the maturityof NLS over the last two years, facilitating applications beyond natural language; a rapid development ofweak supervision methods; and transfer learning as a current trend in IFD which can benefit from thesedevelopments. Finally, we describe a framework for integration of TLS in IFD which is inspired by recentNLS innovations.