Goto

Collaborating Authors

 Liberec


Supplementary Information

Neural Information Processing Systems

The claim and evidence conflict pairs can be found at https://huggingface. The scope of our dataset is purely for scientific research. Conflict V erification: Ensuring that the default and conflict evidence are contradictory. The human evaluation results showed a high level of accuracy in our data generation process. We select models with 2B and 7B parameters for our analysis. MA2 [ Touvron et al., 2023 ] is a popular open-source foundation model, trained on 2T Models with 7B and 70B parameters are selected for our analysis. To facilitate parallel training, we employ DeepSpeed Zero-Stage 3 [ Ren et al., The prompt for generating semantic conflict descriptions is shown in Figure 1 . The prompt for generating default evidence is shown in Table 6 . The prompt for generating misinformation conflict evidence is shown in Table 7 . The prompt for generating temporal conflict evidence is shown in Table 8 . The prompt for generating semantic conflict evidence is shown in Table 9 .



Supplementary Information

Neural Information Processing Systems

The claim and evidence conflict pairs can be found at https://huggingface. The scope of our dataset is purely for scientific research. Conflict V erification: Ensuring that the default and conflict evidence are contradictory. The human evaluation results showed a high level of accuracy in our data generation process. We select models with 2B and 7B parameters for our analysis. MA2 [ Touvron et al., 2023 ] is a popular open-source foundation model, trained on 2T Models with 7B and 70B parameters are selected for our analysis. To facilitate parallel training, we employ DeepSpeed Zero-Stage 3 [ Ren et al., The prompt for generating semantic conflict descriptions is shown in Figure 1 . The prompt for generating default evidence is shown in Table 6 . The prompt for generating misinformation conflict evidence is shown in Table 7 . The prompt for generating temporal conflict evidence is shown in Table 8 . The prompt for generating semantic conflict evidence is shown in Table 9 .


A Benchmark for Evaluating Knowledge Conflicts in Large Language Models

Neural Information Processing Systems

Large language models (LLMs) have achieved impressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts, a major source of hallucinations, has rarely been studied. While a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge, a comprehensive assessment of knowledge conflict in LLMs is still missing.


BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction

arXiv.org Artificial Intelligence

Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we evaluated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction.


Linguistic Knowledge Transfer Learning for Speech Enhancement

arXiv.org Artificial Intelligence

Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.


Conditional Deep Canonical Time Warping

arXiv.org Artificial Intelligence

Temporal alignment of sequences is a fundamental challenge in many applications, such as computer vision and bioinformatics, where local time shifting needs to be accounted for. Misalignment can lead to poor model generalization, especially in high-dimensional sequences. Existing methods often struggle with optimization when dealing with high-dimensional sparse data, falling into poor alignments. Feature selection is frequently used to enhance model performance for sparse data. However, a fixed set of selected features would not generally work for dynamically changing sequences and would need to be modified based on the state of the sequence. Therefore, modifying the selected feature based on contextual input would result in better alignment. Our suggested method, Conditional Deep Canonical Temporal Time Warping (CDCTW), is designed for temporal alignment in sparse temporal data to address these challenges. CDCTW enhances alignment accuracy for high dimensional time-dependent views be performing dynamic time warping on data embedded in maximally correlated subspace which handles sparsity with novel feature selection method. We validate the effectiveness of CDCTW through extensive experiments on various datasets, demonstrating superior performance over previous techniques.


Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

arXiv.org Artificial Intelligence

Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain, and utilizing closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework that requires neither closed-source models nor fine-tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous SoTA systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. We conduct extensive analyses to prove the effectiveness of MACT's multi-agent collaboration in TQA.


DENOASR: Debiasing ASRs through Selective Denoising

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) systems have been examined and shown to exhibit biases toward particular groups of individuals, influenced by factors such as demographic traits, accents, and speech styles. Noise can disproportionately impact speakers with certain accents, dialects, or speaking styles, leading to biased error rates. In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. We find that a combination of two popular speech denoising techniques, viz. DEMUCS and LE, can be effectively used to mitigate ASR disparity without compromising their overall performance. Experiments using two state-of-the-art open-source ASRs - OpenAI WHISPER and NVIDIA NEMO - on multiple benchmark datasets, including TIE, VOX-POPULI, TEDLIUM, and FLEURS, show that there is a promising reduction in the average word error rate gap across the two gender groups. For a given dataset, the denoising is selectively applied on speech samples having speech intelligibility below a certain threshold, estimated using a small validation sample, thus ameliorating the need for large-scale human-written ground-truth transcripts. Our findings suggest that selective denoising can be an elegant approach to mitigate biases in present-day ASR systems.


The first Cadenza challenges: using machine learning competitions to improve music for listeners with a hearing loss

arXiv.org Artificial Intelligence

It is well established that listening to music is an issue for those with hearing loss, and hearing aids are not a universal solution. How can machine learning be used to address this? This paper details the first application of the open challenge methodology to use machine learning to improve audio quality of music for those with hearing loss. The first challenge was a stand-alone competition (CAD1) and had 9 entrants. The second was an 2024 ICASSP grand challenge (ICASSP24) and attracted 17 entrants. The challenge tasks concerned demixing and remixing pop/rock music to allow a personalised rebalancing of the instruments in the mix, along with amplification to correct for raised hearing thresholds. The software baselines provided for entrants to build upon used two state-of-the-art demix algorithms: Hybrid Demucs and Open-Unmix. Evaluation of systems was done using the objective metric HAAQI, the Hearing-Aid Audio Quality Index. No entrants improved on the best baseline in CAD1 because there was insufficient room for improvement. Consequently, for ICASSP24 the scenario was made more difficult by using loudspeaker reproduction and specified gains to be applied before remixing. This also made the scenario more useful for listening through hearing aids. 9 entrants scored better than the the best ICASSP24 baseline. Most entrants used a refined version of Hybrid Demucs and NAL-R amplification. The highest scoring system combined the outputs of several demixing algorithms in an ensemble approach. These challenges are now open benchmarks for future research with the software and data being freely available.