ovis
A Derivation of the Score Function Estimator Given K samples, the objective being maximized is L K (x): = E h log ห Z i ห Z: = 1 K
K!1 and get the asymptotic variance: Var[ g ] = Var " X D.2 Control V ariate for Small ESS In the case ESS 1 we can write log ห Z as a sum of two terms: log ห Z = log w We will leave out a derivation for non-leading terms for brevity. D.3 Unified Interpolation We unify the two ESS limits under a unifying expression OVIS In this paper, gradient ascent is considered (i.e. However, the term (a) may dominate the term (b) . The model is trained for 5.000 epochs using the Adam optimizer with a base learning
A Derivation of the Score Function Estimator Given K samples, the objective being maximized is L K (x): = E h log ห Z i ห Z: = 1 K
K!1 and get the asymptotic variance: Var[ g ] = Var " X D.2 Control V ariate for Small ESS In the case ESS 1 we can write log ห Z as a sum of two terms: log ห Z = log w We will leave out a derivation for non-leading terms for brevity. D.3 Unified Interpolation We unify the two ESS limits under a unifying expression OVIS In this paper, gradient ascent is considered (i.e. However, the term (a) may dominate the term (b) . The model is trained for 5.000 epochs using the Adam optimizer with a base learning
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Lu, Shiyin, Li, Yang, Chen, Qing-Guo, Xu, Zhao, Luo, Weihua, Zhang, Kaifu, Ye, Han-Jia
Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.
CTVIS: Consistent Training for Online Video Instance Segmentation
Ying, Kaining, Zhong, Qing, Mao, Weian, Wang, Zhenhua, Chen, Hao, Wu, Lin Yuanbo, Liu, Yifan, Fan, Chengxiang, Zhuge, Yunzhi, Shen, Chunhua
The discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon the contrastive items (CIs), which are sets of anchor/positive/negative embeddings. Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings. Intuitively, a possible strategy to enhance CIs is replicating the inference phase during training. To this end, we propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines in terms of building CIs. Specifically, CTVIS constructs CIs by referring inference the momentum-averaged embedding and the memory bank storage mechanisms, and adding noise to the relevant embeddings. Such an extension allows a reliable comparison between embeddings of current instances and the stable representations of historical instances, thereby conferring an advantage in modeling VIS challenges such as occlusion, re-identification, and deformation. Empirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three VIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS (35.5% AP). Furthermore, we find that pseudo-videos transformed from images can train robust models surpassing fully-supervised ones.
MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training
Huang, De-An, Yu, Zhiding, Anandkumar, Anima
We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures. By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in training videos as independent images, we can drastically sub-sample the annotated frames in training videos without any modifications. With only 1% of labeled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is that queries trained to be discriminative between intra-frame object instances are temporally consistent and can be used to track instances without any manually designed heuristics. MinVIS thus has the following inference pipeline: we first apply the trained query-based image instance segmentation to video frames independently. The segmented instances are then tracked by bipartite matching of the corresponding queries. This inference is done in an online fashion and does not need to process the whole video at once. MinVIS thus has the practical advantages of reducing both the labeling costs and the memory requirements, while not sacrificing the VIS performance. Code is available at: https://github.com/NVlabs/MinVIS
OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning
Liu, Sheng, Lin, Kevin, Wang, Lijuan, Yuan, Junsong, Liu, Zicheng
We introduce the task of open-vocabulary visual instance search (OVIS). Given an arbitrary textual search query, Open-vocabulary Visual Instance Search (OVIS) aims to return a ranked list of visual instances, i.e., image patches, that satisfies the search intent from an image database. The term "open vocabulary" means that there are neither restrictions to the visual instance to be searched nor restrictions to the word that can be used to compose the textual search query. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA). ViSA leverages massive image-caption pairs as weak image-level (not instance-level) supervision to learn a rich cross-modal semantic space where the representations of visual instances (not images) and those of textual queries are aligned, thus allowing us to measure the similarities between any visual instance and an arbitrary textual query. To evaluate the performance of ViSA, we build two datasets named OVIS40 and OVIS1600 and also introduce a pipeline for error analysis. Through extensive experiments on the two datasets, we demonstrate ViSA's ability to search for visual instances in images not available during training given a wide range of textual queries including those composed of uncommon words. Experimental results show that ViSA achieves an mAP@50 of 21.9% on OVIS40 under the most challenging setting and achieves an mAP@6 of 14.9% on OVIS1600 dataset.