AITopics | stic

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Neural Information Processing SystemsMar-22-2026, 19:43:57 GMT

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs.

large language model, machine learning, natural language, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.59)

Add feedback

ed45d6a03de84cc650cae0655f699356-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 14:47:07 GMT

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > China > Hong Kong (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Enhancing Large Vision Language Models with Self-Training on Image Comprehension Yihe Deng 1, Pan Lu1,3, Fan Yin

Neural Information Processing SystemsOct-10-2025, 20:38:46 GMT

Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire.

arxiv preprint arxiv, language model, stic, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > China > Hong Kong (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Neural Information Processing SystemsMay-27-2025, 20:41:56 GMT

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts.

large language model, machine learning, natural language, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.61)

Add feedback

Causal Discovery from Time-Series Data with Short-Term Invariance-Based Convolutional Neural Networks

Shen, Rujia, Wang, Boran, Zhao, Chao, Guan, Yi, Jiang, Jingchi

arXiv.org Artificial IntelligenceAug-15-2024

Causal discovery from time-series data aims to capture both intra-slice (contemporaneous) and inter-slice (time-lagged) causality between variables within the temporal chain, which is crucial for various scientific disciplines. Compared to causal discovery from non-time-series data, causal discovery from time-series data necessitates more serialized samples with a larger amount of observed time steps. To address the challenges, we propose a novel gradient-based causal discovery approach STIC, which focuses on \textbf{S}hort-\textbf{T}erm \textbf{I}nvariance using \textbf{C}onvolutional neural networks to uncover the causal relationships from time-series data. Specifically, STIC leverages both the short-term time and mechanism invariance of causality within each window observation, which possesses the property of independence, to enhance sample efficiency. Furthermore, we construct two causal convolution kernels, which correspond to the short-term time and mechanism invariance respectively, to estimate the window causal graph. To demonstrate the necessity of convolutional neural networks for causal discovery from time-series data, we theoretically derive the equivalence between convolution and the underlying generative principle of time-series data under the assumption that the additive noise model is identifiable. Experimental evaluations conducted on both synthetic and FMRI benchmark datasets demonstrate that our STIC outperforms baselines significantly and achieves the state-of-the-art performance, particularly when the datasets contain a limited number of observed time steps. Code is available at \url{https://github.com/HITshenrj/STIC}.

causal discovery, causal relationship, dataset, (14 more...)

arXiv.org Artificial Intelligence

2408.08023

Country:

Asia > China > Heilongjiang Province > Harbin (0.05)
North America > United States > North Carolina (0.04)
South America > Brazil > São Paulo (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Health Care Technology (0.48)
Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Deng, Yihe, Lu, Pan, Yin, Fan, Hu, Ziniu, Shen, Sheng, Zou, James, Chang, Kai-Wei, Wang, Wei

arXiv.org Artificial IntelligenceMay-30-2024

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

arxiv preprint arxiv, fine-tuning, stic, (12 more...)

arXiv.org Artificial Intelligence

2405.19716

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.67)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Filters

Collaborating Authors

stic

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

ed45d6a03de84cc650cae0655f699356-Paper-Conference.pdf

Enhancing Large Vision Language Models with Self-Training on Image Comprehension Yihe Deng 1, Pan Lu1,3, Fan Yin

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Causal Discovery from Time-Series Data with Short-Term Invariance-Based Convolutional Neural Networks

Enhancing Large Vision Language Models with Self-Training on Image Comprehension