Goto

Collaborating Authors

 chosen


MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

arXiv.org Artificial Intelligence

Visual preference alignment involves training Large Vision-Language Models (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the costs associated with multi-image data annotations. Our observation reveals that attention values of LVLMs vary considerably across different images. We use attention values to identify and filter out rejected responses the model may have mistakenly focused on. Our attentionaware selection for constructing the chosen/rejected pairs without relying on (i) human annotation, (ii) extra data, and (iii) external models or APIs. MIA-DPO is compatible with various architectures and outperforms existing methods on five multi-image benchmarks, achieving an average performance boost of 3.0% on LLaVA-v1.5 and 4.3% on the recent InternLM-XC2.5. Moreover, MIA-DPO has a minimal effect on the model's ability to understand single images. Recent progress in Large Vision Language Models (LVLMs) marks a significant breakthrough in AI research. While proprietary models (e.g., GPT-4o (OpenAI, 2024)) excel at handling multiimage contexts, current open-source LVLMs (Liu et al., 2024b;a) yield promising results but are primarily focused on single-image visual question answering. In real-world environments, such as digital documents and web pages, multiple figures and texts are interleaved to convey complex information effectively. The ability to understand multi-image contexts is a crucial direction for the future development of LVLMs. LVLMs typically have three stages: (1) Pre-Training, (2) Supervised Fine-Tuning (SFT), and (3) Preference Alignment (i.e., Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) or from AI Feedback (RLAIF) (Bai et al., 2022)). Pre-training and SFT on multi-image data can enhance the model's ability to handle multiple images to some extent. Nevertheless, similar to single-image scenarios, hallucinations remain an inevitable issue.


Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis

arXiv.org Artificial Intelligence

Neural metrics for machine translation (MT) evaluation have become increasingly prominent due to their superior correlation with human judgments compared to traditional lexical metrics. Researchers have therefore utilized neural metrics through quality-informed decoding strategies, achieving better results than likelihood-based methods. With the rise of Large Language Models (LLMs), preference-based alignment techniques have gained attention for their potential to enhance translation quality by optimizing model weights directly on preferences induced by quality estimators. This study focuses on Contrastive Preference Optimization (CPO) and conducts extensive experiments to evaluate the impact of preference-based alignment on translation quality. Our findings indicate that while CPO consistently outperforms Supervised Fine-Tuning (SFT) on high-quality data with regard to the alignment metric, it may lead to instability across downstream evaluation metrics, particularly between neural and lexical ones. Additionally, we demonstrate that relying solely on the base model for generating candidate translations achieves performance comparable to using multiple external systems, while ensuring better consistency across downstream metrics.


CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference

arXiv.org Artificial Intelligence

Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.


CHOSEN: Contrastive Hypothesis Selection for Multi-View Depth Refinement

arXiv.org Artificial Intelligence

We propose CHOSEN, a simple yet flexible, robust and effective multi-view depth refinement framework. It can be employed in any existing multi-view stereo pipeline, with straightforward generalization capability for different multi-view capture systems such as camera relative positioning and lenses. Given an initial depth estimation, CHOSEN iteratively re-samples and selects the best hypotheses, and automatically adapts to different metric or intrinsic scales determined by the capture system. The key to our approach is the application of contrastive learning in an appropriate solution space and a carefully designed hypothesis feature, based on which positive and negative hypotheses can be effectively distinguished. Integrated in a simple baseline multi-view stereo pipeline, CHOSEN delivers impressive quality in terms of depth and normal accuracy compared to many current deep learning based multi-view stereo pipelines.


Saasable is Chosen to Participate in Startup Accelerator Focused on Accounting Innovation

#artificialintelligence

Tech startup works with AICPA and CPA.com to bring automated data analytics and artificial intelligence solutions to accounting Saasable has been selected for the 2021 cohort of the accounting-focused startup accelerator sponsored by Association of International Certified Professional Accountants (AICPA) and CPA.com. As one of five companies chosen for the program, Saasable will be working with industry leaders to solve challenges within the accounting profession. Saasable is a financial reporting and analytics app that allows accountants and SMBs to automate, customize and share daily recurring revenue data with stakeholders in a real-time dashboard. "We're working with the accelerator to learn more about how to sell into the accounting firm space and also to better refine our app. Saasable allows accountants and small-businesses to easily track their recurring revenue metrics. The app also helps accountants become valuable advisors to their clients," said Michael Ly, CEO of Saasable.


Deep Reinforcement Learning with Stacked Hierarchical Attention for Text-based Games

arXiv.org Artificial Intelligence

We study reinforcement learning (RL) for text-based games, which are interactive simulations in the context of natural language. While different methods have been developed to represent the environment information and language actions, existing RL agents are not empowered with any reasoning capabilities to deal with textual games. In this work, we aim to conduct explicit reasoning with knowledge graphs for decision making, so that the actions of an agent are generated and supported by an interpretable inference procedure. We propose a stacked hierarchical attention mechanism to construct an explicit representation of the reasoning process by exploiting the structure of the knowledge graph. We extensively evaluate our method on a number of man-made benchmark games, and the experimental results demonstrate that our method performs better than existing text-based agents.


DJAM: distributed Jacobi asynchronous method for learning personal models

arXiv.org Machine Learning

Processing data collected by a network of agents often boils down to solving an optimization problem. The distributed nature of these problems calls for methods that are, themselves, distributed. While most collaborative learning problems require agents to reach a common (or consensus) model, there are situations in which the consensus solution may not be optimal. For instance, agents may want to reach a compromise between agreeing with their neighbors and minimizing a personal loss function. We present DJAM, a Jacobi-like distributed algorithm for learning personalized models. This method is implementation-friendly: it has no hyperparameters that need tuning, it is asynchronous, and its updates only require single-neighbor interactions. We prove that DJAM converges with probability one to the solution, provided that the personal loss functions are strongly convex and have Lipschitz gradient. We then give evidence that DJAM is on par with state-of-the-art methods: our method reaches a solution with error similar to the error of a carefully tuned ADMM in about the same number of single-neighbor interactions.


Amazon Has Chosen This Framework to Guide Deep Learning Strategy

#artificialintelligence

As artificial intelligence advances, the goal for modern tech companies is to build AI software that thinks for itself without human intervention. Towards that end, Amazon Web Services just picked MXNet, as its favored deep-learning framework to facilitate that work, according to a blog post Tuesday by Amazon chief technology officer Werner Vogels. Deep learning, as detailed in Fortune earlier this year, is a subset of AI that involves the use of software known as neural networks. Within this realm, software learns by churning through vast reams of data with the help of algorithms--not human programmers--to sort it out. Vogels said AWS will provide software code, documentation, and invest in the development of MXnet and the ecosystem of companies supporting it.


Amazon Has Chosen This Framework to Guide Deep Learning Strategy

#artificialintelligence

As artificial intelligence advances, the goal for modern tech companies is to build AI software that thinks for itself without human intervention. Towards that end, Amazon Web Services just picked MXNet, as its favored deep-learning framework to facilitate that work, according to a blog post Tuesday by Amazon chief technology officer Werner Vogels. Deep learning, as detailed in Fortune earlier this year, is a subset of AI that involves the use of software known as neural networks. Within this realm, software learns by churning through vast reams of data with the help of algorithms--not human programmers--to sort it out. Vogels said AWS will provide software code, documentation, and invest in the development of MXnet and the ecosystem of companies supporting it.


Compliance-Aware Bandits

arXiv.org Machine Learning

Motivated by clinical trials, we study bandits with observable non-compliance. At each step, the learner chooses an arm, after, instead of observing only the reward, it also observes the action that took place. We show that such noncompliance can be helpful or hurtful to the learner in general. Unfortunately, naively incorporating compliance information into bandit algorithms loses guarantees on sublinear regret. We present hybrid algorithms that maintain regret bounds up to a multiplicative factor and can incorporate compliance information. Simulations based on real data from the International Stoke Trial show the practical potential of these algorithms.