AITopics | Yang, Zhuolin

Collaborating Authors

Yang, Zhuolin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA, null, :, null, Azzolini, Alisson, Brandon, Hannah, Chattopadhyay, Prithvijit, Chen, Huayu, Chu, Jinju, Cui, Yin, Diamond, Jenna, Ding, Yifan, Ferroni, Francesco, Govindaraju, Rama, Gu, Jinwei, Gururani, Siddharth, Hanafi, Imad El, Hao, Zekun, Huffman, Jacob, Jin, Jingyi, Johnson, Brendan, Khan, Rizwan, Kurian, George, Lantz, Elena, Lee, Nayeon, Li, Zhaoshuo, Li, Xuan, Lin, Tsung-Yi, Lin, Yen-Chen, Liu, Ming-Yu, Mathau, Andrew, Ni, Yun, Pavao, Lindsey, Ping, Wei, Romero, David W., Smelyanskiy, Misha, Song, Shuran, Tchapmi, Lyne, Wang, Andrew Z., Wang, Boxin, Wang, Haoxiang, Wei, Fangyin, Xu, Jiashu, Xu, Yao, Yang, Xiaodong, Yang, Zhuolin, Zeng, Xiaohui, Zhang, Zhe

arXiv.org Artificial IntelligenceMar-18-2025

Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements.

cosmo-reason1, reasoning, video, (15 more...)

arXiv.org Artificial Intelligence

2503.15558

Industry:

Education (0.68)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)

Add feedback

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Yang, Zhuolin, Ping, Wei, Liu, Zihan, Korthikanti, Vijay, Nie, Weili, Huang, De-An, Fan, Linxi, Yu, Zhiding, Lan, Shiyi, Li, Bo, Liu, Ming-Yu, Zhu, Yuke, Shoeybi, Mohammad, Catanzaro, Bryan, Xiao, Chaowei, Anandkumar, Anima

arXiv.org Artificial IntelligenceOct-22-2023

Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained the state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating the database. We also construct an interleaved image and text data that facilitates in-context few-shot learning capabilities. We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks, especially for zero-shot and few-shot generation in out-of-domain settings with 4 times less parameters compared with baseline methods.

caption, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2302.04858

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Interpolation for Robust Learning: Data Augmentation on Wasserstein Geodesics

Zhu, Jiacheng, Qiu, Jielin, Guha, Aritra, Yang, Zhuolin, Nguyen, Xuanlong, Li, Bo, Zhao, Ding

arXiv.org Artificial IntelligenceAug-28-2023

We propose to study and promote the robustness of a model as per its performance through the interpolation of training data distributions. Specifically, (1) we augment the data by finding the worst-case Wasserstein barycenter on the geodesic connecting subpopulation distributions of different categories. (2) We regularize the model for smoother performance on the continuous geodesic path connecting subpopulation distributions. (3) Additionally, we provide a theoretical guarantee of robustness improvement and investigate how the geodesic location and the sample size contribute, respectively. Experimental validations of the proposed strategy on \textit{four} datasets, including CIFAR-100 and ImageNet, establish the efficacy of our method, e.g., our method improves the baselines' certifiable robustness on CIFAR10 up to $7.7\%$, with $16.8\%$ on empirical robustness on CIFAR-100. Our work provides a new perspective of model robustness through the lens of Wasserstein geodesic-based interpolation with a practical off-the-shelf strategy that can be combined with existing robust training methods.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2302.02092

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (0.67)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

On the Certified Robustness for Ensemble Models and Beyond

Yang, Zhuolin, Li, Linyi, Xu, Xiaojun, Kailkhura, Bhavya, Xie, Tao, Li, Bo

arXiv.org Artificial IntelligenceJul-22-2021

Recent studies show that deep neural networks (DNN) are vulnerable to adversarial examples, which aim to mislead DNNs by adding perturbations with small magnitude. To defend against such attacks, both empirical and theoretical defense approaches have been extensively studied for a single ML model. In this work, we aim to analyze and provide the certified robustness for ensemble ML models, together with the sufficient and necessary conditions of robustness for different ensemble protocols. Although ensemble models are shown more robust than a single model empirically; surprisingly, we find that in terms of the certified robustness the standard ensemble models only achieve marginal improvement compared to a single model. Thus, to explore the conditions that guarantee to provide certifiably robust ensemble ML models, we first prove that diversified gradient and large confidence margin are sufficient and necessary conditions for certifiably robust ensemble models under the model-smoothness assumption. We then provide the bounded model-smoothness analysis based on the proposed Ensemble-before-Smoothing strategy. We also prove that an ensemble model can always achieve higher certified robustness than a single base model under mild conditions. Inspired by the theoretical findings, we propose the lightweight Diversity Regularized Training (DRT) to train certifiably robust ensemble ML models. Extensive experiments show that our DRT enhanced ensembles can consistently achieve higher certified robustness than existing single and ensemble ML models, demonstrating the state-of-the-art certified L2-robustness on MNIST, CIFAR-10, and ImageNet datasets.

base model, deep learning, neural network, (19 more...)

arXiv.org Artificial Intelligence

2107.10873

Country:

North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.47)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Understanding Robustness in Teacher-Student Setting: A New Perspective

Yang, Zhuolin, Chen, Zhaoxi, Cai, Tiffany, Chen, Xinyun, Li, Bo, Tian, Yuandong

arXiv.org Artificial IntelligenceFeb-28-2021

Adversarial examples have appeared as a ubiquitous property of machine learning models where bounded adversarial perturbation could mislead the models to make arbitrarily incorrect predictions. Such examples provide a way to assess the robustness of machine learning models as well as a proxy for understanding the model training process. Extensive studies try to explain the existence of adversarial examples and provide ways to improve model robustness (e.g. adversarial training). While they mostly focus on models trained on datasets with predefined labels, we leverage the teacher-student framework and assume a teacher model, or oracle, to provide the labels for given instances. We extend Tian (2019) in the case of low-rank input data and show that student specialization (trained student neuron is highly correlated with certain teacher neuron at the same layer) still happens within the input subspace, but the teacher and student nodes could differ wildly out of the data subspace, which we conjecture leads to adversarial examples. Extensive experiments show that student specialization correlates strongly with model robustness in different scenarios, including student trained via standard training, adversarial training, confidence-calibrated adversarial training, and training with robust feature dataset. Our studies could shed light on the future exploration about adversarial examples, and enhancing model robustness via principled data augmentation.

artificial intelligence, neural network, specialization, (17 more...)

arXiv.org Artificial Intelligence

2102.1317

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Scalable Differentially Private Generative Student Model via PATE

Long, Yunhui, Lin, Suxin, Yang, Zhuolin, Gunter, Carl A., Li, Bo

arXiv.org Machine LearningJun-21-2019

Recent rapid development of machine learning is largely due to algorithmic breakthroughs, computation resource development, and especially the access to a large amount of training data. However, though data sharing has the great potential of improving machine learning models and enabling new applications, there have been increasing concerns about the privacy implications of data collection. In this work, we present a novel approach for training differentially private data generator G-PATE. The generator can be used to produce synthetic datasets with strong privacy guarantee while preserving high data utility. Our approach leverages generative adversarial nets (GAN) to generate data and protect data privacy based on the Private Aggregation of Teacher Ensembles (PATE) framework. Our approach improves the use of privacy budget by only ensuring differential privacy for the generator, which is the part of the model that actually needs to be published for private data generation. To achieve this, we connect a student generator with an ensemble of teacher discriminators. We also propose a private gradient aggregation mechanism to ensure differential privacy on all the information that flows from the teacher discriminators to the student generator. We empirically show that the G-PATE significantly outperforms prior work on both image and non-image datasets.

educational technology, generator, neural network, (19 more...)

arXiv.org Machine Learning

1906.09338

Genre:

Research Report (1.00)
Overview (0.86)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Technology > Educational Software (0.41)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Characterizing Audio Adversarial Examples Using Temporal Dependency

Yang, Zhuolin, Li, Bo, Chen, Pin-Yu, Song, Dawn

arXiv.org Artificial IntelligenceSep-28-2018

Recent studies have highlighted adversarial examples as a ubiquitous threat to different neural network models and many downstream applications. Nonetheless, as unique data properties have inspired distinct and powerful learning principles, this paper aims to explore their potentials towards mitigating adversarial inputs. In particular, our results reveal the importance of using the temporal dependency in audio data to gain discriminate power against adversarial examples. Tested on the automatic speech recognition (ASR) tasks and three recent audio adversarial attacks, we find that (i) input transformation developed from image adversarial defense provides limited robustness improvement and is subtle to advanced attacks; (ii) temporal dependency can be exploited to gain discriminative power against audio adversarial examples and is resistant to adaptive attacks considered in our experiments. Our results not only show promising means of improving the robustness of ASR systems, but also offer novel insights in exploiting domain-specific data properties to mitigate negative effects of adversarial examples. Deep Neural Networks (DNNs) have been widely adopted in a variety of machine learning applications (Krizhevsky et al., 2012; Hinton et al., 2012; Levine et al., 2016). However, recent work has demonstrated that DNNs are vulnerable to adversarial perturbations (Szegedy et al., 2014; Goodfel-low et al., 2015). An adversary can add negligible perturbations to inputs and generate adversarial examples to mislead DNNs, first found in image-based machine learning tasks (Goodfellow et al., 2015; Carlini & Wagner, 2017a; Liu et al., 2017; Chen et al., 2017b;a; Su et al., 2018). Beyond images, given the wide application of DNN-based audio recognition systems, such as Google Home and Amazon Alexa, audio adversarial examples have also been studied recently (Car-lini & Wagner, 2018; Alzantot et al., 2018; Cisse et al., 2017; Kreuk et al., 2018). Comparing between image and audio learning tasks, although their state-of-the-art DNN architectures are quite different (i.e., convolutional v.s.

adversarial example, deep learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

1809.10875

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback