metadata
60ea0211b38a3ccd7a241f523dc7cf63-Supplemental-Datasets_and_Benchmarks_Track.pdf
Below we describe a few other prevalent multi-label datasets and explain how the ML48S differs800 from them, hence they were excluded from comparison in this paper.801 PASCALVOC [11] was created for object detection and classification, covering 20 basic-level802 classes across 4,574 images, with most images containing a single prominent object. This dataset is803 much smaller than ML48S and also contains much fewer classes which are all coarse-grained.804 VG500 is a modification of the Visual Genome dataset [19], a dataset focused on dense annotations805 linking images to respective captions. This dataset is not intended to be bounded by categories806 but has open-vocabulary annotations.
528d56195a2c77c808494c86fa7c77ad-Supplemental-Datasets_and_Benchmarks_Track.pdf
A.1 Dataset Examples450 In this section of the appendix, we present a detailed overview of several representative tasks from451 each category included in REASONINGGYM. For each task, we describe its structure, complexity452 parameters, and provide examples.453 A.1.1 complex_arithmetic(Algebra)454 Find the solution of an arithmetic operation involving complex numbers.455 The spiral order is clockwise, starting from the top-left corner. Predict the corresponding output grid by applying the rule you found.
PurpCode: Reasoning for Safer Code Generation
Jiawei Liu, Nirav Diwan, Zhe Wang, Haoyu Zhai, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Muntasir Wahed, Yinlin Deng, Hadjer Benkraouda, Yuxiang Wei, Lingming Zhang, Ismini Lourentzou, Gang Wang
We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerabilityfree code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Moreover, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs (Supplementary Material)
In this section, we introduce the construction pipeline for generating MVU-Eval QA pairs based on2 each data source.3 These questions include: (1) Object Recognition, (2)8 Spatial Understanding, (3) Counting, (4) Knowledge-intensive Reasoning, and (5) Temporal9 Reasoning. These generated questions, answers, and candidate choices are manually checked by10 humans. Pipelines for constructing video pairs are slightly different across datasets.11 By default, 2-6 videos are randomly sampled, regardless of their labels.
End-to-End Low-Light Enhancement for Object Detection with Learned Metadata from RAWs
Although RAW images offer advantages over sRGB by avoiding ISP-induced distortion and preserving more information in low-light conditions, their widespread use is limited due to high storage costs, transmission burdens, and the need for significant architectural changes for downstream tasks. To address the issues, this paper explores a new raw-based machine vision paradigm, termed Compact RAW Metadata-guided Image Refinement (CRM-IR). In particular, we propose a Machine Vision-oriented Image Refinement (MV-IR) module that refines sRGB images to better suit machine vision preferences, guided by learned raw metadata. In detail, we propose a Cross-Modal Contextual Entropy (CMCE) network for raw metadata extraction and compression. It builds upon the latent representation and entropy modeling framework of learned image compression methods, and uniquely exploits the contextual correspondence between raw images and their sRGB counterparts to achieve more efficient and compact metadata representation. Additionally, we integrate priors derived from the ISP pipeline to simplify the refinement process, enabling a more efficient design. Such a design allows the CRM-IR to focus on extracting the most essential metadata from raw images to support downstream machine vision tasks, while remaining plug-and-play and fully compatible with existing imaging pipelines, without any changes to model architectures or ISP modules. We implement our CRM-IR scheme on various object detection networks, and extensive experiments under low-light conditions demonstrate that it can significantly improve performance with an additional bitrate cost of less than 10 3 bits per pixel.
Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction
Estimating agent pose and 3D scene structure from multi-camera rigs is a central task in embodied AI applications such as autonomous driving. Recent learned approaches such as DUSt3R have shown impressive performance in multiview settings. However, these models treat images as unstructured collections, limiting effectiveness in scenarios where frames are captured from synchronized rigs with known or inferable structure. To this end, we introduce Rig3R, a generalization of prior multiview reconstruction models that incorporates rig structure when available, and learns to infer it when not. Rig3R conditions on optional rig metadata including camera IDs, timestamp, and rig calibrations to develop a rig-aware latent space that remains robust to missing information.
ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions
Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning & distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally.
End-to-End Low-Light Enhancement for Object Detection with Learned Metadata from RAWs
Although RAW images offer advantages over sRGB by avoiding ISP-induced distortion and preserving more information in low-light conditions, their widespread use is limited due to high storage costs, transmission burdens, and the need for significant architectural changes for downstream tasks. To address the issues, this paper explores a new raw-based machine vision paradigm, termed Compact RAW Metadata-guided Image Refinement (CRM-IR). In particular, we propose a Machine Vision-oriented Image Refinement (MV-IR) module that refines sRGB images to better suit machine vision preferences, guided by learned raw metadata. Such a design allows the CRM-IR to focus on extracting the most essential metadata from raw images to support downstream machine vision tasks, while remaining plug-and-play and fully compatible with existing imaging pipelines, without any changes to model architectures or ISP modules. We implement our CRM-IR scheme on various object detection networks, and extensive experiments under low-light conditions demonstrate that it can significantly improve performance with an additional bitrate cost of less than $10^{-3}$ bits per pixel.