Goto

Collaborating Authors

 dvd


Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation

Wang, Jing, Bae, Wonho, Chen, Jiahong, Wang, Wenxu, Noh, Junhyug

arXiv.org Artificial Intelligence

Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature's label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature's latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier's accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve.


Review for NeurIPS paper: Effective Diversity in Population Based Reinforcement Learning

Neural Information Processing Systems

Weaknesses: The paper may need to be improved to address a few important issues, as detailed below. Why is it important to enhance population-wide behavioral diversity? Intuitively I can understand the potential benefits related to deep exploration and learning stability. However, theoretically I cannot link the benefits straightforwardly to the proposed use of kernel function and the kernel matrix determinant. Theorem 3.3 states that when lambda is set properly, the population will contain M distinct optimal policies.


Markov Chain of Thought for Efficient Mathematical Reasoning

Yang, Wen, Fan, Kai, Liao, Minpeng

arXiv.org Artificial Intelligence

Chain of Thought (CoT) of multi-step benefits from the logical structure of the reasoning steps and task-specific actions, significantly enhancing the mathematical reasoning capabilities of large language models. As the prevalence of long CoT, the number of reasoning steps exceeds manageable token limits and leads to higher computational demands. Inspired by the fundamental logic of human cognition, ``derive, then reduce'', we conceptualize the standard multi-step CoT as a novel Markov Chain of Thought (MCoT). In this study, we consider the mathematical reasoning task, defining each reasoning step as text accompanied by a Python code snippet. To facilitate a longer reasoning path, self-correction is enabled through interactions with the code interpreter. Our MCoT aims to compress previous reasoning steps into a simplified question, enabling efficient next-step inference without relying on a lengthy KV cache. In our experiments, we curate the \texttt{MCoTInstruct} dataset, and the empirical results indicate that MCoT not only significantly enhances efficiency but also maintains comparable accuracy. While much remains to be explored, this work paves the way for exploring the long CoT reasoning abilities of LLMs.


PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation

Luo, Jing, Luo, Run, Chen, Longze, Zhu, Liang, Ao, Chang, Li, Jiaming, Chen, Yukun, Cheng, Xin, Yang, Wen, Su, Jiayuan, Li, Chengming, Yang, Min

arXiv.org Artificial Intelligence

While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models continue to struggle with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage is learning from Persona Diversification, and the second stage is learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a novel personadriven data augmentation technique to enhance the dataset's quantity and diversity. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on LLaMA-2-7B) achieves an accuracy of 24.2% on MATH and 68.7% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 70.3K data points--merely 17.8% of MetaMathQA and 27% of MathInstruct--yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. "There are a thousand Hamlets in a thousand people's eyes" Among these tasks, solving math problems stands out as particularly challenging due to its complexity and the requirement for multi-step reasoning to reach a solution. While some closed-source models, such as GPT-4o (OpenAI, 2024a), Claude 3.5 Sonnet (Anthropic, 2024), and Gemini 1.5 Pro (Reid et al., 2024), have demonstrated strong math-solving capabilities, current open-source models (e.g., LLaMA (Touvron et al., 2023; Dubey et al., 2024)) continue to struggle in this area. Therefore, enhancing the math problem-solving abilities of open-source models is a prominent desiderata. A widely adopted and effective approach for improving the math-solving capabilities of open-source models is fine-tuning, owing to the accessibility of their weights (Yuan et al., 2023; Yue et al., 2023; The method consists of two stages: Stage 1 (top) and Stage 2 (bottom). Stage 1 focuses on using closed-source LLMs to automatically generate detailed CoT solutions and apply our persona-driven rewriting method to rephrase the questions.


VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

Qian, Kun, Wan, Shunji, Tang, Claudia, Wang, Youzhi, Zhang, Xuanming, Chen, Maximillian, Yu, Zhou

arXiv.org Artificial Intelligence

As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model's predictions for centralized processing and then publish the model's result on their leaderboard. However, this submission process is inefficient and prevents effective error analysis. To address this issue, we propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time. We applied this variable perturbation method to four datasets: GSM8K, ARC, CommonsenseQA, and TruthfulQA, which cover mathematical generation and multiple-choice tasks. Our experimental results demonstrate that this approach provides a more accurate assessment of the true capabilities of language models, effectively mitigating the contamination problem.


How disruption can lead to innovation in how you conduct your business

#artificialintelligence

When Marc Randolph and Reed Hastings launched the world's first online DVD-rental service in California in 1997, it was far from the behemoth that Netflix is today. Customers had fewer than 1,000 titles to choose from, and they had to pay separately for every DVD they rented. The founders nearly sold the company to Amazon for around US$15 million, and when they tried to sell it to Blockbuster forUS$50 million a few years later, Blockbuster's CEO was reported to have said to Netflix, "The dotcom hysteria is completely overblown." Fast forward two decades, and the company now has more than 200 million subscribers around the world and more than US$20 billion in annual revenue. The simple answer is'reinvention' – a requirement that can present itself to any business, but which not all business leaders have the courage or forethought to embrace.


A Brief Introduction to Recommendation Systems

#artificialintelligence

Have you ever wondered how apps like Netflix or Spotify decide which movie or songs you're likely to prefer watching or listening to? Seems like magic, doesn't it? For instance, a lot of data is being mined and multiple complicated algorithms are developed by data science professionals in an attempt to make predictions more accurate. It is not magic but "machine learning." Machine learning is what allows the system to determine the movies and songs most relevant to your liking.


Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos

Chen, Annie S., Nair, Suraj, Finn, Chelsea

arXiv.org Artificial Intelligence

We are motivated by the goal of generalist robots that can complete a wide range of tasks across many environments. Critical to this is the robot's ability to acquire some metric of task success or reward, which is necessary for reinforcement learning, planning, or knowing when to ask for help. For a general-purpose robot operating in the real world, this reward function must also be able to generalize broadly across environments, tasks, and objects, while depending only on on-board sensor observations (e.g. RGB images). While deep learning on large and diverse datasets has shown promise as a path towards such generalization in computer vision and natural language, collecting high quality datasets of robotic interaction at scale remains an open challenge. In contrast, "in-the-wild" videos of humans (e.g. YouTube) contain an extensive collection of people doing interesting tasks across a diverse range of settings. In this work, we propose a simple approach, Domain-agnostic Video Discriminator (DVD), that learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task, and can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. We find that by leveraging diverse human datasets, this reward function (a) can generalize zero shot to unseen environments, (b) generalize zero shot to unseen tasks, and (c) can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.


DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

Le, Hung, Sankar, Chinnadhurai, Moon, Seungwhan, Beirami, Ahmad, Geramifard, Alborz, Kottur, Satwik

arXiv.org Artificial Intelligence

A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem involving complex multimodal and temporal inputs, and studying them independently is hard with existing datasets. Existing benchmarks do not have enough annotations to help analyze dialogue systems and understand their linguistic and visual reasoning capability and limitations in isolation. These benchmarks are also not explicitly designed to minimize biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning each question requires, including cross-turn video interval tracking and dialogue object tracking. We use our dataset to analyze several dialogue system approaches, providing interesting insights into their abilities and limitations. In total, the dataset contains $10$ instances of $10$-round dialogues for each of $\sim11k$ synthetic videos, resulting in more than $100k$ dialogues and $1M$ question-answer pairs. Our code and dataset will be made public.


DREAM.ac: Build Teams Using Artificial Intelligence

#artificialintelligence

Artificial Intelligence is being deployed to address many human problems, most recently Google's Duplex can make reservations on your behalf by talking to a human. We have had some really interesting clients in the Human Resources (HR) space, a field dominated by human interaction. The main question we see from clients is how to use A.I. to do headhunting or job matching of candidates to roles. Today I want to walk you through the solution architecture for one of our clients in the HR space, and give you a sense for how A.I. can be deployed to automate and improve HR processes. The motivating problem is simple: 68% of projects fail and 90% of startups fail.