AITopics | language token

Collaborating Authors

language token

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies

Mena, Carlos, Serra, Pol, Romero, Jacobo, Messaoudi, Abir, Giraldo, Jose, Armentano-Oller, Carme, Zevallos, Rodolfo, Meza, Ivan, Hernando, Javier

arXiv.org Artificial IntelligenceJul-21-2025

The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI's Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.13875

Country:

Europe (0.47)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.34)

Industry:

Media (0.47)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Chopra, Anuradha, Roy, Abhinaba, Herremans, Dorien

arXiv.org Artificial IntelligenceJun-19-2025

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

caption, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2506.15154

Country:

Europe > Italy > Lombardy > Milan (0.04)
Europe > Italy > Lazio > Rome (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

Li, Shiyao, Hu, Yingchun, Ning, Xuefei, Liu, Xihui, Hong, Ke, Jia, Xiaotao, Li, Xiuhong, Yan, Yaqi, Ran, Pei, Dai, Guohao, Yan, Shengen, Yang, Huazhong, Wang, Yu

arXiv.org Artificial IntelligenceDec-27-2024

Vision-Language Models (VLMs) have enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead. Existing PTQ methods mainly focus on large language models (LLMs), without considering the differences across other modalities. In this paper, we discover that there is a significant difference in sensitivity between language and vision tokens in large VLMs. Therefore, treating tokens from different modalities equally, as in existing PTQ methods, may over-emphasize the insensitive modalities, leading to significant accuracy loss. To deal with the above issue, we propose a simple yet effective method, Modality-Balanced Quantization (MBQ), for large VLMs. Specifically, MBQ incorporates the different sensitivities across modalities during the calibration process to minimize the reconstruction loss for better quantization parameters. Extensive experiments show that MBQ can significantly improve task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for 7B to 70B VLMs, compared to SOTA baselines. Additionally, we implement a W3 GPU kernel that fuses the dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B on the RTX 4090. The code is available at https://github.com/thu-nics/MBQ.

large language model, natural language, quantization, (16 more...)

arXiv.org Artificial Intelligence

2412.19509

Country: Asia > China (0.46)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Tu, Dezhan, Vashchilenko, Danylo, Lu, Yuzhe, Xu, Panpan

arXiv.org Artificial IntelligenceOct-29-2024

Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.

kv cache, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2410.23317

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

Yang, Chih-Kai, Huang, Kuan-Po, Lee, Hung-yi

arXiv.org Artificial IntelligenceJul-9-2024

This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.

information, language token, textual prompt, (17 more...)

arXiv.org Artificial Intelligence

2406.05806

Country:

North America > United States > Washington > King County > Seattle (0.04)
Europe > Italy > Lombardy > Milan (0.04)
Asia > Taiwan (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Education > Educational Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Wu, Kai, Jiang, Boyuan, Jiang, Zhengkai, He, Qingdong, Luo, Donghao, Wang, Shengzhi, Liu, Qingwen, Wang, Chengjie

arXiv.org Artificial IntelligenceMay-31-2024

Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at https://kaiwu5.github.io/noiseboost.

hallucination, noiseboost, perturbation, (15 more...)

arXiv.org Artificial Intelligence

2405.20081

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

Peng, Puyuan, Yan, Brian, Watanabe, Shinji, Harwath, David

arXiv.org Artificial IntelligenceAug-15-2023

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts. Experiments show that compared to the default prompts, our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets. In addition, our experiments reveal many interesting properties of Whisper, including its robustness to prompts, bias on accents, and the multilingual understanding in its latent space. Code is available at https://github.com/jasonppy/PromptingWhisper

large language model, natural language, translation, (19 more...)

arXiv.org Artificial Intelligence

2305.11095

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Asia > East Asia (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation

Guo, Hongcheng, Liu, Jiaheng, Huang, Haoyang, Yang, Jian, Li, Zhoujun, Zhang, Dongdong, Cui, Zheng, Wei, Furu

arXiv.org Artificial IntelligenceNov-28-2022

Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features, which has attracted considerable attention from both natural language processing and computer vision communities. Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases in the real world. In other words, the multilingual multimodal machine translation (Multilingual MMT) task has not been investigated, which aims to handle the aforementioned issues by providing a shared semantic space for multiple languages. Besides, the image modality has no language boundaries, which is superior to bridging the semantic gap between languages. To this end, we first propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages. Then, an effective baseline LVP-M3 using visual prompts is proposed to support translations between different languages, which includes three stages (token encoding, language-aware visual prompt generation, and language translation). Extensive experimental results on our constructed benchmark datasets demonstrate the effectiveness of LVP-M3 method for Multilingual MMT.

artificial intelligence, lvp-m 3, natural language, (15 more...)

arXiv.org Artificial Intelligence

2210.15461

Country: Asia > China (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Setting the rhythm scene: deep learning-based drum loop generation from arbitrary language cues

Tripodi, Ignacio J.

arXiv.org Artificial IntelligenceSep-20-2022

Generative artificial intelligence models can be a valuable aid to music composition and live performance, both to aid the professional musician and to help democratize the music creation process for hobbyists. Here we present a novel method that, given an English word or phrase, generates 2 compasses of a 4-piece drum pattern that embodies the "mood" of the given language cue, or that could be used for an audiovisual scene described by the language cue. We envision this tool as composition aid for electronic music and audiovisual soundtrack production, or an improvisation tool for live performance. In order to produce the training samples for this model, besides manual annotation of the "scene" or "mood" terms, we have designed a novel method to extract the consensus drum track of any song. This consists of a 2-bar, 4-piece drum pattern that represents the main percussive motif of a song, which could be imported into any music loop device or live looping software. These two key components (drum pattern generation from a generalizable input, and consensus percussion extraction) present a novel approach to computer-aided composition and provide a stepping stone for more comprehensive rhythm generation.

artificial intelligence, drum pattern, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2209.10016

Country: North America > United States > Colorado (0.04)

Genre: Research Report > Promising Solution (0.75)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MATra: A Multilingual Attentive Transliteration System for Indian Scripts

Raj, Yash, Laddagiri, Bhavesh

arXiv.org Artificial IntelligenceAug-23-2022

Transliteration is a task in the domain of NLP where the output word is a similar-sounding word written using the letters of any foreign language. Today this system has been developed for several language pairs that involve English as either the source or target word and deployed in several places like Google Translate and chatbots. However, there is very little research done in the field of Indic languages transliterated to other Indic languages. This paper demonstrates a multilingual model based on transformers (with some modifications) that can give noticeably higher performance and accuracy than all existing models in this domain and get much better results than state-of-the-art models. This paper shows a model that can perform transliteration between any pair among the following five languages - English, Hindi, Bengali, Kannada and Tamil. It is applicable in scenarios where language is a barrier to communication in any written task. The model beats the state-of-the-art (for all pairs among the five mentioned languages - English, Hindi, Bengali, Kannada, and Tamil) and achieves a top-1 accuracy score of 80.7%, about 29.5% higher than the best current results. Furthermore, the model achieves 93.5% in terms of Phonetic Accuracy (transliteration is primarily a phonetic/sound-based task).

dataset, indic, transliteration, (15 more...)

arXiv.org Artificial Intelligence

2208.10801

Country: Asia > India > Chandigarh (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback