Efficient Learning for Product Attributes with Compact Multimodal Models

Jul-29-2025–arXiv.org Artificial Intelligence

Image-based product attribute prediction in e-commerce is a crucial task with numerous applications. The supervised fine-tuning of Vision Language Models (VLMs) faces significant scale challenges due to the cost of manual or API based annotation. In this paper, we investigate label-efficient semi-supervised fine-tuning strategies for compact VLMs (2B-3B parameters) that leverage unlabeled product listings through Direct Preference Optimization (DPO). Beginning with a small, API-based, annotated, and labeled set, we first employ PEFT to train low-rank adapter modules. T o update the adapter weights with unlabeled data, we generate multiple reasoning-and-answer chains per unlabeled sample and segregate these chains into preferred and dispreferred based on self-consistency. W e then fine-tune the model with DPO loss and use the updated model for the next iteration. By using PEFT fine-tuning with DPO, our method achieves efficient convergence with minimal compute overhead. On a dataset spanning twelve e-commerce verticals, DPO-based fine-tuning, which utilizes only unlabeled data, demonstrates a significant improvement over the supervised model. Moreover, experiments demonstrate that accuracy with DPO training improves with more unlabeled data, indicating that a large pool of unlabeled samples can be effectively leveraged to improve performance.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Jul-29-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology (0.71)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.47)
  - Machine Learning
    - Unsupervised or Indirectly Supervised Learning (0.78)
    - Inductive Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found