Goto

Collaborating Authors

 vip



Single-Call Stochastic Extragradient Methods for Structured Non-monotone Variational Inequalities: Improved Analysis under Weaker Conditions

Neural Information Processing Systems

Single-call stochastic extragradient methods, like stochastic past extragradient (SPEG) and stochastic optimistic gradient (SOG), have gained a lot of interest in recent years and are one of the most efficient algorithms for solving large-scale min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, despite their undoubted popularity, current convergence analyses of SPEG and SOG require strong assumptions like bounded variance or growth conditions. In addition, several important questions regarding the convergence properties of these methods are still open, including mini-batching, efficient step-size selection, and convergence guarantees under different sampling strategies. In this work, we address these questions and provide convergence guarantees for two large classes of structured non-monotone VIPs: (i) quasi-strongly monotone problems (a generalization of strongly monotone problems) and (ii) weak Minty variational inequalities (a generalization of monotone and Minty VIPs). We introduce the expected residual condition, explain its benefits, and show how it allows us to obtain a strictly weaker bound than previously used growth conditions, expected co-coercivity, or bounded variance assumptions. Finally, our convergence analysis holds under the arbitrary sampling paradigm, which includes importance sampling and various mini-batching strategies as special cases.


Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise

Neural Information Processing Systems

Moreover, the only known high-probability complexity results have been derived under restrictive sub-Gaussian (light-tailed) noise and bounded domain assumption [Juditsky et al., 2011a].


Appendix A Proof of Theoretical results

Neural Information Processing Systems

A.1 Proof of Proposition 1 and 3 To prove Proposition 1, we first need the following lemma: Readers may refer to [47] for the proof of this lemma. Let's first consider the left handside, The first inequality is due to information processing inequality. The compactness assumption in Proposition 2 seems restrictive, since BNNs with Gaussian priors on weights will break the compactness assumption. Indeed, the assumptions in proposition 2 are merely sufficient conditions. In this section, we discuss the non-parametric counter part of Proposition 2, i.e., is the grid functional KL between a parametric model and a Gaussian process is still finite?




Spatio-temporal Multivariate Time Series Forecast with Chosen Variables

Liu, Zibo, Jiang, Zhe, Xu, Zelin, Xiao, Tingsong, Zhang, Yupu, Xiao, Zhengkun, Wang, Haibo, Chen, Shigang

arXiv.org Artificial Intelligence

Spatio-Temporal Multivariate time series Forecast (STMF) uses the time series of $n$ spatially distributed variables in a period of recent past to forecast their values in a period of near future. It has important applications in spatio-temporal sensing forecast such as road traffic prediction and air pollution prediction. Recent papers have addressed a practical problem of missing variables in the model input, which arises in the sensing applications where the number $m$ of sensors is far less than the number $n$ of locations to be monitored, due to budget constraints. We observe that the state of the art assumes that the $m$ variables (i.e., locations with sensors) in the model input are pre-determined and the important problem of how to choose the $m$ variables in the input has never been studied. This paper fills the gap by studying a new problem of STMF with chosen variables, which optimally selects $m$-out-of-$n$ variables for the model input in order to maximize the forecast accuracy. We propose a unified framework that jointly performs variable selection and model optimization for both forecast accuracy and model efficiency. It consists of three novel technical components: (1) masked variable-parameter pruning, which progressively prunes less informative variables and attention parameters through quantile-based masking; (2) prioritized variable-parameter replay, which replays low-loss past samples to preserve learned knowledge for model stability; (3) dynamic extrapolation mechanism, which propagates information from variables selected for the input to all other variables via learnable spatial embeddings and adjacency information. Experiments on five real-world datasets show that our work significantly outperforms the state-of-the-art baselines in both accuracy and efficiency, demonstrating the effectiveness of joint variable selection and model optimization.


Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Long, Lin, Oh, Changdae, Park, Seongheon, Li, Sharon

arXiv.org Artificial Intelligence

Large vision-language models (L VLMs) achieve strong performance on multi-modal tasks, yet they often default to their language prior (LP)--memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within L VLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the T otal Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary L VLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. Modern large vision-language models (L VLMs) (OpenAI, 2025; Comanici et al., 2025; Bai et al., 2025; Zhu et al., 2025) have extended the boundaries of AI applications at an unprecedented rate. Their remarkable capability in solving highly complex vision-language tasks originated from the internalized rich unimodal knowledge during the pre-training (Radford et al., 2021; Oquab et al., 2024; Brown et al., 2020) and also from the strong multimodal alignment (Liu et al., 2023; Dai et al., 2023; Zhu et al., 2024). Despite their successes, a central challenge remains: L VLMs are prone to over-relying on their language prior (LP)--the statistical patterns memorized during large-scale language model pretraining--while under-utilizing the actual visual evidence (Fu et al., 2024; Lee et al., 2025; Luo et al., 2025). This imbalance often results in hallucinations, shortcut reasoning, and brittle generalization when tasks truly demand visual grounding.



NeoARCADE: Robust Calibration for Distance Estimation to Support Assistive Drones for the Visually Impaired

Raj, Suman, Madhabhavi, Bhavani A, Kumar, Madhav, Gupta, Prabhav, Simmhan, Yogesh

arXiv.org Artificial Intelligence

Autonomous navigation by drones using onboard sensors, combined with deep learning and computer vision algorithms, is impacting a number of domains. We examine the use of drones to autonomously follow and assist Visually Impaired People (VIPs) in navigating urban environments. Estimating the absolute distance between the drone and the VIP, and to nearby objects, is essential to design obstacle avoidance algorithms. Here, we present NeoARCADE (Neo), which uses depth maps over monocular video feeds, common in consumer drones, to estimate absolute distances to the VIP and obstacles. Neo proposes robust calibration technique based on depth score normalization and coefficient estimations to translate relative distances from depth map to absolute ones. It further develops a dynamic recalibration method that can adapt to changing scenarios. We also develop two baseline models, Regression and Geometric, and compare Neo with SOTA depth map approaches and the baselines. We provide detailed evaluations to validate their robustness and generalizability for distance estimation to VIPs and other obstacles in diverse and dynamic conditions, using datasets collected in a campus environment. Neo predicts distances to VIP with an error <30cm, and to different obstacles like cars and bicycles within a maximum error of 60cm, which are better than the baselines. Neo also clearly out-performs SOTA depth map methods, reporting errors up to 5.3-14.6x lower.