synonym
On the Powerfulness of Textual Outlier Exposure for Visual OoDDetection (Appendix) AAdditional experimental results
This section presents more comprehensive experimental results. A.1 Comparison with post-hoc methods We also compare the performance of our textual outlier method with post-hoc approaches, which are another prominent approach in OoD detection. We conducted comparisons with six widely used and recently proposed methods known for their detection performance (MSP [4], ODIN [8], Mahalanobis [7], Energy [10], ReAct [14], KNN [15]). All advanced baseline methods follow the original paper's settings. Among these methods, our textual outlier approach demonstrate the best performance, further emphasizing its effectiveness as demonstrated in Table 6.
Type-to-Track: Retrieve Any Object via Prompt-based Tracking Supplementary Appendix 1 Dataset Taxonomy nmsyndefcapretr
We introduce two new evaluation scenarios cap and retr so that they are more specific on the object level than on the category level. It is because defining objects by category synonyms and category names and definition is insufficient to describe them accurately, leading to ambiguous results. The benchmarking sets can provide more accurate and meaningful evaluations of multiple object retrieval methods by focusing on the object level. We include a comprehensive taxonomy of prompt types used to construct our settings. However, the retr setting on the MOT17 could not be constructed because test annotations for this dataset are unavailable. To construct this setting, bounding boxes will be filtered to the corresponding retrieval prompt when it changes. Section 2 describes how to construct this retrieval prompt .
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration
Ren, Mucheng, Chen, He, Yan, Yuchen, Hu, Danqing, Xu, Jun, Zeng, Xian
Automated International Classification of Diseases (ICD) coding assigns standardized diagnosis and procedure codes to clinical records, playing a critical role in healthcare systems. However, existing methods face challenges such as semantic gaps between clinical text and ICD codes, poor performance on rare and long-tail codes, and limited interpretability. To address these issues, we propose TraceCoder, a novel framework integrating multi-source external knowledge to enhance traceability and explainability in ICD coding. TraceCoder dynamically incorporates diverse knowledge sources, including UMLS, Wikipedia, and large language models (LLMs), to enrich code representations, bridge semantic gaps, and handle rare and ambiguous codes. It also introduces a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge, improving long-tail code recognition and making predictions interpretable by grounding them in external evidence. Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets demonstrate that TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components. TraceCoder offers a scalable and robust solution for automated ICD coding, aligning with clinical needs for accuracy, interpretability, and reliability.
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
Moll, Johannes, Graf, Markus, Lemke, Tristan, Lenhart, Nicolas, Truhn, Daniel, Delbrouck, Jean-Benoit, Pan, Jiazhen, Rueckert, Daniel, Adams, Lisa C., Bressem, Keno K.
Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $τ_b=0.670$), moderate alignment for fidelity ($τ_b=0.387$), and weak alignment for confidence tone ($τ_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality can be decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.
QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps
Pekkanen, Matti, Verdoja, Francesco, Kyrki, Ville
Embeddings from Visual-Language Models are increasingly utilized to represent semantics in robotic maps, offering an open-vocabulary scene understanding that surpasses traditional, limited labels. Embeddings enable on-demand querying by comparing embedded user text prompts to map embeddings via a similarity metric. The key challenge in performing the task indicated in a query is that the robot must determine the parts of the environment relevant to the query. This paper proposes a solution to this challenge. We leverage natural-language synonyms and antonyms associated with the query within the embedding space, applying heuristics to estimate the language space relevant to the query, and use that to train a classifier to partition the environment into matches and non-matches. We evaluate our method through extensive experiments, querying both maps and standard image benchmarks. The results demonstrate increased queryability of maps and images. Our querying technique is agnostic to the representation and encoder used, and requires limited training.