conf
Searching Efficient Semantic Segmentation Architectures via Dynamic Path Selection
Existing NAS methods for semantic segmentation typically apply uniform optimization to all candidate networks (paths) within a one-shot supernet. However, the concurrent existence of both promising and suboptimal paths often results in inefficient weight updates and gradient conflicts. This issue is particularly severe in semantic segmentation due to its complex multi-branch architectures and large search space, which further degrade the supernet's ability to accurately evaluate individual paths and identify high-quality candidates. To address this issue, we propose Dynamic Path Selection (DPS), a selective training strategy that leverages multiple performance proxies to guide path optimization. DPS follows a stagewise paradigm, where each phase emphasizes a different objective: early stages prioritize convergence, the middle stage focuses on expressiveness, and the final stage emphasizes a balanced combination of expressiveness and generalization. At each stage, paths are selected based on these criteria, concentrating optimization efforts on promising paths, thus facilitating targeted and efficient model updates. Additionally, DPS integrates a dynamic stage scheduler and a diversity-driven exploration strategy, which jointly enable adaptive stage transitions and maintain structural diversity among selected paths. Extensive experiments demonstrate that, under the same search space, DPS can discover efficient models with strong generalization and superior performance.
bd20ff18345f0ded89242bf9ef58e46c-Paper-Position_Paper_Track.pdf
This position paper argues that human pose estimation (HPE) cannot be considered privacy-preserving or human-centric unless privacy is measured and evaluated. Although privacy concerns have become more visible in recent years, HPE systems are still assessed almost exclusively using accuracy metrics. Privacy is neither defined in measurable terms nor linked to regulatory requirements, and common deployment architectures introduce additional risks due to data transmission and storage. We highlight the limitations of current practices, including the continued reliance on RGB inputs and the lack of benchmarks that reflect legal and ethical constraints. We call for a shift in evaluation practices: privacy must become part of how HPE systems are designed, tested, and compared.
Diversity-oriented Deep Multi-modal Clustering
Deep multi-modal clustering (DMC) aims to explore the correlated information from different modalities to improve the clustering performance. Most existing DMCs attempt to investigate the consistency or/and complementarity information by fusing all modalities, but this will lead to the following challenges: 1) Information conflicts between modalities emerge.
Beyond Masked and Unmasked Discrete Diffusion Models via Partial Masking
Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation.
RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models
Low-Rank Adaptation (LoRA) lowers the computational and memory overhead of fine-tuning large models by updating a low-dimensional subspace of the pretrained weight matrix. Albeit efficient, LoRA exhibits suboptimal convergence and noticeable performance degradation, due to inconsistent and imbalanced weight updates induced by its nonunique low-rank factorizations. To overcome these limitations, this article identifies the optimal low-rank factorization per step that minimizes an upper bound on the loss. The resultant refactored low-rank adaptation (RefLoRA) method promotes a flatter loss landscape, along with consistent and balanced weight updates, thus speeding up stable convergence. Extensive experiments evaluate RefLoRA on natural language understanding, and commonsense reasoning tasks with popular large language models including DeBERTaV3, LLaMA-7B, LLaMA2-7B and LLaMA3-8B. The numerical tests corroborate that RefLoRA converges faster, outperforms various benchmarks, and enjoys negligible computational overhead compared to state-of-the-art LoRA variants.
PandaPose: 3DHuman Pose Lifting from a Single Image via Propagating 2DPose Prior to 3DAnchor Space
Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies.
Differentiable Hierarchical Visual Tokenization
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment
Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1% LLM-verified correctness and reducing conflict edges by 18.6% through multi-layer assessments.