AITopics | Wakaki, Hiromi

Plotting

Wakaki, Hiromi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

Gao, Silin, Mathew, Sheryl, Mi, Li, Mamooler, Sepideh, Zhao, Mengjie, Wakaki, Hiromi, Mitsufuji, Yuki, Montariol, Syrielle, Bosselut, Antoine

arXiv.org Artificial IntelligenceApr-3-2025

Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.20871

Genre: Research Report > New Finding (0.48)

Industry: Media (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)

Add feedback

Cross-Modal Learning for Music-to-Music-Video Description Generation

Mao, Zhuoyuan, Zhao, Mengjie, Wu, Qiyu, Zhong, Zhi, Liao, Wei-Hsiang, Wakaki, Hiromi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceMar-14-2025

Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.1119

Country:

North America > United States (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.87)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)

Add feedback

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Mao, Zhuoyuan, Zhao, Mengjie, Wu, Qiyu, Wakaki, Hiromi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceFeb-18-2025

Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-alignment Transformer to enhance modality fusion prior to input into text LLMs, tailoring DeepResonance for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We plan to open-source the models and the newly constructed datasets.

artificial intelligence, large language model, music-centric multi-way instruction tuning, (2 more...)

arXiv.org Artificial Intelligence

2502.12623

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

TED: Turn Emphasis with Dialogue Feature Attention for Emotion Recognition in Conversation

Ono, Junya, Wakaki, Hiromi

arXiv.org Artificial IntelligenceJan-2-2025

Emotion recognition in conversation (ERC) has been attracting attention by methods for modeling multi-turn contexts. The multi-turn input to a pretraining model implicitly assumes that the current turn and other turns are distinguished during the training process by inserting special tokens into the input sequence. This paper proposes a priority-based attention method to distinguish each turn explicitly by adding dialogue features into the attention mechanism, called Turn Emphasis with Dialogue (TED). It has a priority for each turn according to turn position and speaker information as dialogue features. It takes multi-head self-attention between turn-based vectors for multi-turn input and adjusts attention scores with the dialogue features. We evaluate TED on four typical benchmarks. The experimental results demonstrate that TED has high overall performance in all datasets and achieves state-of-the-art performance on IEMOCAP with numerous turns.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.01123

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.62)

Add feedback

OpenMU: Your Swiss Army Knife for Music Understanding

Zhao, Mengjie, Zhong, Zhi, Mao, Zhuoyuan, Yang, Shiqi, Liao, Wei-Hsiang, Takahashi, Shusuke, Wakaki, Hiromi, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceNov-27-2024

We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.15573

Country:

Europe (0.67)
North America > United States > Pennsylvania (0.14)
North America > United States > Michigan (0.14)
North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

Distillation of Discrete Diffusion through Dimensional Correlations

Hayakawa, Satoshi, Takida, Yuhta, Imaizumi, Masaaki, Wakaki, Hiromi, Mitsufuji, Yuki

arXiv.org Machine LearningOct-11-2024

Diffusion models have demonstrated exceptional performances in various fields of generative modeling. While they often outperform competitors including VAEs and GANs in sample quality and diversity, they suffer from slow sampling speed due to their iterative nature. Recently, distillation techniques and consistency models are mitigating this issue in continuous domains, but discrete diffusion models have some specific challenges towards faster generation. Most notably, in the current literature, correlations between different dimensions (pixels, locations) are ignored, both by its modeling and loss functions, due to computational limitations. In this paper, we propose "mixture" models in discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: first, that dimensionally independent models can well approximate the data distribution if they are allowed to conduct many sampling steps, and second, that our loss functions enables mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. We empirically demonstrate that our proposed method for discrete diffusions work in practice, by distilling a continuous-time discrete diffusion model pretrained on the CIFAR-10 dataset.

artificial intelligence, dimensional correlation, discrete diffusion, (1 more...)

arXiv.org Machine Learning

2410.08709

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence (0.93)

Add feedback

ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark

Wakaki, Hiromi, Mitsufuji, Yuki, Maeda, Yoshinori, Nishimura, Yukiko, Gao, Silin, Zhao, Mengjie, Yamada, Keiichi, Bosselut, Antoine

arXiv.org Artificial IntelligenceJun-17-2024

We propose a new benchmark, ComperDial, which facilitates the training and evaluation of evaluation metrics for open-domain dialogue systems. ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents submitted to the Commonsense Persona-grounded Dialogue (CPD) challenge. As a result, for any dialogue, our benchmark includes multiple diverse responses with variety of characteristics to ensure more robust evaluation of learned dialogue metrics. In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores, enabling joint assessment of multi-turn model responses throughout a dialogue. Finally, building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations. Our experimental results demonstrate that our novel metric, CPDScore is more correlated with human judgments than existing metrics. We release both ComperDial and CPDScore to the community to accelerate development of automatic evaluation metrics for open-domain dialogue systems.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2406.11228

Country:

Europe (1.00)
Asia (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.99)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)
Information Technology > Artificial Intelligence > Machine Learning (0.73)

Add feedback

Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

Xie, Zhouhang, Majumder, Bodhisattwa Prasad, Zhao, Mengjie, Maeda, Yoshinori, Yamada, Keiichi, Wakaki, Hiromi, McAuley, Julian

arXiv.org Artificial IntelligenceMar-23-2024

We consider the task of building a dialogue system that can motivate users to adopt positive lifestyle changes: Motivational Interviewing. Addressing such a task requires a system that can infer \textit{how} to motivate a user effectively. We propose DIIT, a framework that is capable of learning and applying conversation strategies in the form of natural language inductive rules from expert demonstrations. Automatic and human evaluation on instruction-following large language models show natural language strategy descriptions discovered by DIIR can improve active listening skills, reduce unsolicited advice, and promote more collaborative and less authoritative responses, outperforming various demonstration utilization methods.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2403.15737

Country:

Europe (1.00)
North America > United States > New Mexico (0.14)
North America > United States > Colorado (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

Gao, Silin, Ismayilzada, Mete, Zhao, Mengjie, Wakaki, Hiromi, Mitsufuji, Yuki, Bosselut, Antoine

arXiv.org Artificial IntelligenceFeb-26-2024

Inferring contextually-relevant and diverse commonsense to understand narratives remains challenging for knowledge models. In this work, we develop a series of knowledge models, DiffuCOMET, that leverage diffusion to learn to reconstruct the implicit semantic connections between narrative contexts and relevant commonsense knowledge. Across multiple diffusion steps, our method progressively refines a representation of commonsense facts that is anchored to a narrative, producing contextually-relevant and diverse commonsense inferences for an input context. To evaluate DiffuCOMET, we introduce new metrics for commonsense inference that more closely measure knowledge diversity and contextual relevance. Our results on two different benchmarks, ComFact and WebNLG+, show that knowledge generated by DiffuCOMET achieves a better trade-off between commonsense diversity, contextual relevance and alignment to known gold references, compared to baseline knowledge models.

artificial intelligence, knowledge, knowledge management, (19 more...)

arXiv.org Artificial Intelligence

2402.17011

Country:

North America > United States (0.28)
Europe (0.28)
Asia (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.89)

Add feedback

Using Natural Language Inference to Improve Persona Extraction from Dialogue in a New Domain

DeLucia, Alexandra, Zhao, Mengjie, Maeda, Yoshinori, Yoda, Makoto, Yamada, Keiichi, Wakaki, Hiromi

arXiv.org Artificial IntelligenceJan-12-2024

While valuable datasets such as PersonaChat provide a foundation for training persona-grounded dialogue agents, they lack diversity in conversational and narrative settings, primarily existing in the "real" world. To develop dialogue agents with unique personas, models are trained to converse given a specific persona, but hand-crafting these persona can be time-consuming, thus methods exist to automatically extract persona information from existing character-specific dialogue. However, these persona-extraction models are also trained on datasets derived from PersonaChat and struggle to provide high-quality persona information from conversational settings that do not take place in the real world, such as the fantasy-focused dataset, LIGHT. Creating new data to train models on a specific setting is human-intensive, thus prohibitively expensive. To address both these issues, we introduce a natural language inference method for post-hoc adapting a trained persona extraction model to a new setting. We draw inspiration from the literature of dialog natural language inference (NLI), and devise NLI-reranking methods to extract structured persona information from dialogue. Compared to existing persona extraction models, our method returns higher-quality extracted persona and requires less human annotation.

computational linguistic, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2401.06742

Country:

Europe (1.00)
Asia > Middle East > UAE (0.14)
North America > United States > Louisiana (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback