Supplementary Materials AGMMU: AComprehensive Agricultural Multimodal Understanding Benchmark Aruna Gauba1,2,5 Irene Pi1,3,5 Yunze Man1,4,5 Ziqi Pang1,4,5 Vikram S. Adve1,4,5 Yu-Xiong Wang1,4,5
–Neural Information Processing Systems
Our evaluation and analysis are conducted mainly on the group of models listed in Table 2 in the13 main paper. We have chosen models such that they cover most of the popular and best-performing14 methods used by recent multimodal understanding work. In this part, we discuss all the models we15 have used in our experiments and explain their evaluation details, the public checkpoints we have16 chosen, and display the prompts we used to adapt the model to our datasets.17 During evaluation, we chose to follow the standard prompt provided by the authors whenever possi-18 ble for multiple-choice and short-answer questions. When the prompt is not provided for the model,19 we select a custom prompt that is created through several iterations of prompt engineering to select20 the one that produces the most effective results. The images are always included as the prefix.21 We used three proprietary models in our evaluation: GPT-o4-mini [1], Gem-22 ini 1.5 Pro [9], and Claude 3 Haiku [10]. Below we note the model API version used for evaluation.23 GPT-o4-mini: May 13-15, 2025.24 Cambrian-1 is a recent state-of-the-art model that excels at visual-centric tasks.27 This model explores combinations of vision encoders, text and image integration techniques, and28 instruction tuning strategies. We use the official implementation and checkpoint1 with a LLaMA3-29 8B-Instruct LLM backbone model in our evaluation.30 InternVL scales up the vision foundation model while aligning it with the back-31 bone LLM, and is trained on web-scale image-text data to achieve strong performance across a vari-32 ety of vision-centric tasks. We use the official implementation and checkpoint2 with the InternViT-33 300M-448px vision backbone and Internlm2.5-7B-chat LLaMA-3.2 is the first collection of multimodal large language model from the35 LLaMA family that was previously text-only. The integration of vision involves utilizing cross-36 attention layers and a pre-trained vision encoder that feeds directly into the text-processor. The37 model follows a commonly used training recipe that includes pretraining on noisy image-text pairs38 and then high-quality knowledge enhanced pairs. Notably, the language-model parameters were39 frozen during the training of alignment of image and text to retain strong text-only capabilities. We40 use the official implementation and checkpoint3 that uses a LLaMA-3.1 text-only language backbone41 in our evaluation. When evaluating the model, we choose to use a custom prompt since no standard42 prompt is provided.43
Neural Information Processing Systems
Jun-18-2026, 05:58:16 GMT
- Country:
- North America > United States (1.00)
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Technology: