vcr
Reviews: Heterogeneous Graph Learning for Visual Commonsense Reasoning
Originality: The VCR task is a novel task (proposed by Zellers et al, CVPR19). The proposed HGL framework for this interesting task is novel and interesting. The paper applies the HGL framework on top of the baseline model (R2C from Zellers et al., CVPR19) and shows significant improvements. The paper compares other existing graph learning approaches. The main difference between the proposed approach and other graph learning approaches is the heterogeneous nature (across domains – vision and language) of the graph learning framework. Quality: The paper does a good job of evaluating the propsed approach and its ablations.
Reviews: Connective Cognition Network for Directional Visual Commonsense Reasoning
Originality: The paper proposes a novel model for the recently introduced VCR task. The main novelty of the proposed model lies in the component GraphVLAD and directional GCN modules. The paper describes that one of the closest works to this work is that of Narsimhan et al., NeurIPS 2018 that used GCN to infer answers in VQA, however that work constructs an undirected graph, ignoring the directional information between the graph nodes. This paper uses directed graph instead and shows the usefulness of incorporating directional information. It would be good for this paper to include more related work on GraphVLAD front. Quality: The paper evaluates the proposed approach on the VCR dataset and compares with the baselines and previous state-of-the-art, demonstrating how the proposed work improves the previous best performance significantly.
Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn
Deep neural networks provide Reinforcement Learning (RL) powerful function approximators to address large-scale decision-making problems. However, these approximators introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored. In this work, we start by characterizing churn in a view of Generalized Policy Iteration with function approximation, and we discover a chain effect of churn that leads to a cycle where the churns in value estimation and policy improvement compound and bias the learning dynamics throughout the iteration. Further, we concretize the study and focus on the learning issues caused by the chain effect in different settings, including greedy action deviation in value-based methods, trust region violation in proximal policy optimization, and dual bias of policy value in actor-critic methods. We then propose a method to reduce the chain effect across different settings, called Churn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms. Our experiments demonstrate the effectiveness of our method in both reducing churn and improving learning performance across online and offline, value-based and policy-based RL settings, as well as a scaling setting.
Microsoft asks to dismiss New York Times's 'doomsday' copyright lawsuit
The tech giant said the lawsuit was near-sighted and akin to Hollywood's losing backlash against the VCR. In a motion to dismiss part of the lawsuit filed Monday, Microsoft, which was sued in December alongside ChatGPT-maker OpenAI, scoffed at the newspaper's claim that Times content receives "particular emphasis" and that tech companies "seek to free-ride on the Times's massive investment in its journalism". But in its response, Microsoft said the lawsuit was akin to Hollywood's resistance to the VCR that consumers used to record TV shows and which the entertainment business in the late 1970s feared would destroy its economic model. "'The VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone,'" Microsoft said in its response, quoting from congressional testimony delivered by Jack Valenti, then head of the motion picture association of America, in 1982. In this case, Microsoft said, the Times is attempting to use "its might and its megaphone to challenge the latest profound technological advance: the Large Language Model."
- Media (1.00)
- Leisure & Entertainment (1.00)
- Law > Litigation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Sun, Rui, Wang, Zhecan, You, Haoxuan, Codella, Noel, Chang, Kai-Wei, Chang, Shih-Fu
Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine
Variance-Covariance Regularization Improves Representation Learning
Zhu, Jiachen, Shwartz-Ziv, Ravid, Chen, Yubei, LeCun, Yann
Transfer learning has emerged as a key approach in the machine learning domain, enabling the application of knowledge derived from one domain to improve performance on subsequent tasks. Given the often limited information about these subsequent tasks, a strong transfer learning approach calls for the model to capture a diverse range of features during the initial pretraining stage. However, recent research suggests that, without sufficient regularization, the network tends to concentrate on features that primarily reduce the pretraining loss function. This tendency can result in inadequate feature learning and impaired generalization capability for target tasks. To address this issue, we propose Variance-Covariance Regularization (VCR), a regularization technique aimed at fostering diversity in the learned network features. Drawing inspiration from recent advancements in the self-supervised learning approach, our approach promotes learned representations that exhibit high variance and minimal covariance, thus preventing the network from focusing solely on loss-reducing features. We empirically validate the efficacy of our method through comprehensive experiments coupled with in-depth analytical studies on the learned representations. In addition, we develop an efficient implementation strategy that assures minimal computational overhead associated with our method. Our results indicate that VCR is a powerful and efficient method for enhancing transfer learning performance for both supervised learning and self-supervised learning, opening new possibilities for future research in this domain.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York (0.04)
- Telecommunications > Networks (0.34)
- Information Technology > Networks (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering
Whitehouse, Chenxi, Weyde, Tillman, Madhyastha, Pranava
The field of visual question answering (VQA) has recently seen a surge in research focused on providing explanations for predicted answers. However, current systems mostly rely on separate models to predict answers and generate explanations, leading to less grounded and frequently inconsistent results. To address this, we propose a multitask learning approach towards a Unified Model for Answer and Explanation generation (UMAE). Our approach involves the addition of artificial prompt tokens to training data and fine-tuning a multimodal encoder-decoder model on a variety of VQA-related tasks. In our experiments, UMAE models surpass the prior state-of-the-art answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new state-of-the-art explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (4 more...)
- Leisure & Entertainment (0.46)
- Transportation (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.71)
Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning
Yue, Yang, Kang, Bingyi, Xu, Zhongwen, Huang, Gao, Yan, Shuicheng
Deep reinforcement learning (RL) algorithms suffer severe performance degradation when the interaction data is scarce, which limits their real-world application. Recently, visual representation learning has been shown to be effective and promising for boosting sample efficiency in RL. These methods usually rely on contrastive learning and data augmentation to train a transition model for state prediction, which is different from how the model is used in RL--performing value-based planning. Accordingly, the learned representation by these visual methods may be good for recognition but not optimal for estimating state value and solving the decision problem. To address this issue, we propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. More specifically, VCR trains a model to predict the future state (also referred to as the ''imagined state'') based on the current one and a sequence of actions. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state. We develop two implementations of the above idea for the discrete and continuous action spaces respectively. We conduct experiments on Atari 100K and DeepMind Control Suite benchmarks to validate their effectiveness for improving sample efficiency. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.