Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Zhang, Yi, Ni, Bolin, Chen, Xin-Sheng, Zhang, Heng-Rui, Rao, Yongming, Peng, Houwen, Lu, Qinglin, Hu, Han, Guo, Meng-Hao, Hu, Shi-Min
–arXiv.org Artificial Intelligence
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
arXiv.org Artificial Intelligence
Nov-12-2025
- Country:
- Africa
- Asia
- Japan > Honshū
- Kansai > Kyoto Prefecture > Kyoto (0.04)
- Indonesia > Bali (0.04)
- Middle East
- Israel (0.04)
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- China
- South Korea > Seoul
- Seoul (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Singapore (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Japan > Honshū
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy
- France
- Bourgogne-Franche-Comté > Doubs
- Besançon (0.04)
- Île-de-France > Paris
- Paris (0.04)
- Bourgogne-Franche-Comté > Doubs
- United Kingdom > England
- Surrey (0.04)
- Netherlands
- North Holland > Amsterdam (0.04)
- South Holland > Rotterdam (0.04)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Austria > Vienna (0.14)
- Switzerland > Zürich
- Zürich (0.14)
- Spain > Galicia
- Madrid (0.04)
- Ireland > Leinster
- North America
- Canada > British Columbia
- Vancouver (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California
- Los Angeles County > Long Beach (0.04)
- San Diego County > San Diego (0.04)
- Santa Clara County
- Mountain View (0.04)
- San Jose (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Miami (0.14)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Tennessee > Davidson County
- Nashville (0.04)
- Utah > Salt Lake County
- Salt Lake City (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New York (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- California
- Canada > British Columbia
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Victoria > Melbourne (0.04)
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Information Technology (0.92)
- Transportation (0.67)
- Technology: