AITopics

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsNov-20-2025, 09:36:35 GMT

A Derivation of the Score Function Estimator Given K samples, the objective being maximized is L K (x): = E h log ˆ Z i ˆ Z: = 1 K

K!1 and get the asymptotic variance: Var[ g ] = Var " X D.2 Control V ariate for Small ESS In the case ESS 1 we can write log ˆ Z as a sum of two terms: log ˆ Z = log w We will leave out a derivation for non-leading terms for brevity. D.3 Unified Interpolation We unify the two ESS limits under a unifying expression OVIS In this paper, gradient ascent is considered (i.e. However, the term (a) may dominate the term (b) . The model is trained for 5.000 epochs using the Adam optimizer with a base learning

estimator, gradient, inference network, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsNov-15-2025, 06:08:36 GMT

Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds Valentin Liévin 1 Andrea Dittadi 1 Anders Christensen

We will compare the different estimators on the basis of their Signal-to-noise ratio.

arxiv preprint arxiv, control variate, estimator, (11 more...)

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsAug-18-2025, 22:30:19 GMT

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP .

artificial intelligence, machine learning, query, (18 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsAug-16-2025, 05:16:29 GMT

A Derivation of the Score Function Estimator Given K samples, the objective being maximized is L K (x): = E h log ˆ Z i ˆ Z: = 1 K

K!1 and get the asymptotic variance: Var[ g ] = Var " X D.2 Control V ariate for Small ESS In the case ESS 1 we can write log ˆ Z as a sum of two terms: log ˆ Z = log w We will leave out a derivation for non-leading terms for brevity. D.3 Unified Interpolation We unify the two ESS limits under a unifying expression OVIS In this paper, gradient ascent is considered (i.e. However, the term (a) may dominate the term (b) . The model is trained for 5.000 epochs using the Adam optimizer with a base learning

estimator, gradient, inference network, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsAug-16-2025, 05:16:22 GMT

c15203a83f778ce8934d0efaf2d5c6f3-Paper.pdf

arxiv preprint arxiv, control variate, estimator, (11 more...)

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJun-17-2024

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

Lu, Shiyin, Li, Yang, Chen, Qing-Guo, Xu, Zhao, Luo, Weihua, Zhang, Kaifu, Ye, Han-Jia

Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.

arxiv, ovis, zhang, (15 more...)

2405.20797

Country:

Europe (0.14)
Asia > China > Jiangsu Province > Nanjing (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Mexico (0.04)

Genre: Research Report (0.64)

Industry:

Education (0.48)
Health & Medicine (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJul-24-2023

CTVIS: Consistent Training for Online Video Instance Segmentation

Ying, Kaining, Zhong, Qing, Mao, Weian, Wang, Zhenhua, Chen, Hao, Wu, Lin Yuanbo, Liu, Yifan, Fan, Chengxiang, Zhuge, Yunzhi, Shen, Chunhua

The discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon the contrastive items (CIs), which are sets of anchor/positive/negative embeddings. Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings. Intuitively, a possible strategy to enhance CIs is replicating the inference phase during training. To this end, we propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines in terms of building CIs. Specifically, CTVIS constructs CIs by referring inference the momentum-averaged embedding and the memory bank storage mechanisms, and adding noise to the relevant embeddings. Such an extension allows a reliable comparison between embeddings of current instances and the stable representations of historical instances, thereby conferring an advantage in modeling VIS challenges such as occlusion, re-identification, and deformation. Empirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three VIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS (35.5% AP). Furthermore, we find that pseudo-videos transformed from images can train robust models surpassing fully-supervised ones.

artificial intelligence, ctvis, machine learning, (18 more...)

2307.12616

Country:

Asia > China (0.04)
Oceania > Australia > South Australia > Adelaide (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Sensing and Signal Processing (0.68)

Huang, De-An, Yu, Zhiding, Anandkumar, Anima

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

arXiv.org Artificial IntelligenceAug-3-2022

We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures. By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in training videos as independent images, we can drastically sub-sample the annotated frames in training videos without any modifications. With only 1% of labeled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is that queries trained to be discriminative between intra-frame object instances are temporally consistent and can be used to track instances without any manually designed heuristics. MinVIS thus has the following inference pipeline: we first apply the trained query-based image instance segmentation to video frames independently. The segmented instances are then tracked by bipartite matching of the corresponding queries. This inference is done in an online fashion and does not need to process the whole video at once. MinVIS thus has the practical advantages of reducing both the labeling costs and the memory requirements, while not sacrificing the VIS performance. Code is available at: https://github.com/NVlabs/MinVIS

artificial intelligence, machine learning, social media, (20 more...)

2208.02245

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Communications > Social Media (0.66)

arXiv.org Artificial IntelligenceAug-8-2021

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

Liu, Sheng, Lin, Kevin, Wang, Lijuan, Yuan, Junsong, Liu, Zicheng

We introduce the task of open-vocabulary visual instance search (OVIS). Given an arbitrary textual search query, Open-vocabulary Visual Instance Search (OVIS) aims to return a ranked list of visual instances, i.e., image patches, that satisfies the search intent from an image database. The term "open vocabulary" means that there are neither restrictions to the visual instance to be searched nor restrictions to the word that can be used to compose the textual search query. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA). ViSA leverages massive image-caption pairs as weak image-level (not instance-level) supervision to learn a rich cross-modal semantic space where the representations of visual instances (not images) and those of textual queries are aligned, thus allowing us to measure the similarities between any visual instance and an arbitrary textual query. To evaluate the performance of ViSA, we build two datasets named OVIS40 and OVIS1600 and also introduce a pipeline for error analysis. Through extensive experiments on the two datasets, we demonstrate ViSA's ability to search for visual instances in images not available during training given a wide range of textual queries including those composed of uncommon words. Experimental results show that ViSA achieves an mAP@50 of 21.9% on OVIS40 under the most challenging setting and achieves an mAP@6 of 14.9% on OVIS1600 dataset.

query, representation, similarity, (17 more...)

2108.03704

Genre: Research Report (0.70)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)