AITopics | Inner Mongolia

Collaborating Authors

Inner Mongolia

MSCMHMST: A traffic flow prediction model based on Transformer

Geng, Weiyang, Pan, Yiming, Xing, Zhecong, Liu, Dongyu, Liu, Rui, Zhu, Yuan

arXiv.org Artificial IntelligenceMar-15-2025

This study proposes a hybrid model based on Transformers, named MSCMHMST, aimed at addressing key challenges in traffic flow prediction. Traditional single-method approaches show limitations in traffic prediction tasks, whereas hybrid methods, by integrating the strengths of different models, can provide more accurate and robust predictions. The MSCMHMST model introduces a multi-head, multi-scale attention mechanism, allowing the model to parallel process different parts of the data and learn its intrinsic representations from multiple perspectives, thereby enhancing the model's ability to handle complex situations. This mechanism enables the model to capture features at various scales effectively, understanding both short-term changes and long-term trends. Verified through experiments on the PeMS04/08 dataset with specific experimental settings, the MSCMHMST model demonstrated excellent robustness and accuracy in long, medium, and short-term traffic flow predictions. The results indicate that this model has significant potential, offering a new and effective solution for the field of traffic flow prediction.

artificial intelligence, machine learning, transformer, (16 more...)

arXiv.org Artificial Intelligence

2503.1354

Country: Asia > China > Inner Mongolia (0.17)

Genre: Research Report (0.82)

Industry:

Consumer Products & Services > Travel (0.77)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

Perception-Guided EEG Analysis: A Deep Learning Approach Inspired by Level of Detail (LOD) Theory

Tong, BG

arXiv.org Artificial IntelligenceFeb-8-2025

Objective: This study aims to explore a novel deep learning approach for analyzing electroencephalogram (EEG) data and guiding human perceptual states, inspired by the Level of Detail (LOD) theory. The core objective is to improve the accuracy of identifying perceptual states from EEG signals and to provide new avenues for personalized psychological therapy. Methods: The research employs portable EEG devices to collect data, combined with music rhythm signals for analysis. We introduce the LOD theory to dynamically adjust the processing levels of EEG signals, extracting core features related to perception. The software system is developed using the Unity engine, integrating audio materials and MIDI structures, and enabling the integration of EEG data with Unity. The deep learning model includes a Convolutional Neural Network (CNN) for feature extraction and classification, and a Deep Q-Network (DQN) for reinforcement learning to optimize music rhythm adjustment strategies. Results: The CNN model achieved a 94.05% accuracy in the perceptual state classification task, demonstrating excellent classification performance. The DQN model successfully guided subjects' EEG signals to the target perceptual state with a 92.45% success rate on the validation set, requiring an average of 13.2 rhythm cycles to complete the state guidance. However, subjective feedback from users indicated that approximately 50% of the researchers experienced psychological sensations corresponding to the target state during the rhythm adjustment process, suggesting room for improvement in the system's effectiveness.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.10428

Country: Asia > China > Inner Mongolia (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

Liu, Rui, Jia, Zhenqi, Bao, Feilong, Li, Haizhou

arXiv.org Artificial IntelligenceJan-11-2025

Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: https://github.com/Coder-jzq/RADKA-CSS.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2501.06467

Country: Asia > China > Inner Mongolia (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.68)

Add feedback

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

Zhao, Yuan, Liu, Rui, Cong, Gaoxiang

arXiv.org Artificial IntelligenceDec-31-2024

Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.

artificial intelligence, current sentence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2412.18748

Country: Asia > China > Inner Mongolia (0.14)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.97)
Information Technology > Artificial Intelligence > Vision (0.70)

Add feedback

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Jia, Zhenqi, Liu, Rui

arXiv.org Artificial IntelligenceDec-24-2024

Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2412.18733

Country: Asia > China > Inner Mongolia (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.87)

Add feedback

Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

He, Shuwei, Liu, Rui

arXiv.org Artificial IntelligenceDec-23-2024

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS2KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of each source.This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive speech experience. Experimental results demonstrate that the MS$^2$KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/AI-S2-Lab/MS2KU-VTTS.

artificial intelligence, knowledge, spatial reasoning, (18 more...)

arXiv.org Artificial Intelligence

2410.14101

Country: Asia > China > Inner Mongolia (0.14)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)

Add feedback

Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Dong, Xiaochong, Dan, Jun, Sun, Yingyun, Liu, Yang, Zhang, Xuemin, Mei, Shengwei

arXiv.org Artificial IntelligenceDec-15-2024

Driven by global climate change and the ongoing energy transition, the coupling between power supply capabilities and meteorological factors has become increasingly significant. Over the long term, accurately quantifying the power generation capacity of renewable energy under the influence of climate change is essential for the development of sustainable power systems. However, due to interdisciplinary differences in data requirements, climate data often lacks the necessary hourly resolution to capture the short-term variability and uncertainties of renewable energy resources. To address this limitation, a super-resolution recurrent diffusion model (SRDM) has been developed to enhance the temporal resolution of climate data and model the short-term uncertainty. The SRDM incorporates a pre-trained decoder and a denoising network, that generates long-term, high-resolution climate data through a recurrent coupling mechanism. The high-resolution climate data is then converted into power value using the mechanism model, enabling the simulation of wind and photovoltaic (PV) power generation capacity on future long-term scales. Case studies were conducted in the Ejina region of Inner Mongolia, China, using fifth-generation reanalysis (ERA5) and coupled model intercomparison project (CMIP6) data under two climate pathways: SSP126 and SSP585. The results demonstrate that the SRDM outperforms existing generative models in generating super-resolution climate data. For the Ejina region, under a high-emission pathway, the annual utilization hours of wind power are projected to decrease by 2.82 hours/year, while those for PV power are projected to decrease by 0.26 hours/year. Furthermore, the research highlights the estimation biases introduced when low-resolution climate data is used for power conversion.

artificial intelligence, climate data, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2412.11399

Country: Asia > China > Inner Mongolia (0.25)

Genre: Research Report > New Finding (0.66)

Industry:

Energy > Renewable > Solar (1.00)
Energy > Renewable > Wind (0.91)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation

Wang, Weihua, Liang, Qiuyu, Bao, Feilong, Gao, Guanglai

arXiv.org Artificial IntelligenceDec-12-2024

Quaternion contains one real part and three imaginary parts, which provided a more expressive hypercomplex space for learning knowledge graph. Existing quaternion embedding models measure the plausibility of a triplet either through semantic matching or geometric distance scoring functions. However, it appears that semantic matching diminishes the separability of entities, while the distance scoring function weakens the semantics of entities. To address this issue, we propose a novel quaternion knowledge graph embedding model. Our model combines semantic matching with entity's geometric distance to better measure the plausibility of triplets. Specifically, in the quaternion space, we perform a right rotation on head entity and a reverse rotation on tail entity to learn rich semantic features. Then, we utilize distance adaptive translations to learn geometric distance between entities. Furthermore, we provide mathematical proofs to demonstrate our model can handle complex logical relationships. Extensive experimental results and analyses show our model significantly outperforms previous models on well-known knowledge graph completion benchmark datasets. Our code is available at https://github.com/llqy123/DaBR.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.04076

Country: Asia > China > Inner Mongolia (0.14)

Genre:

Research Report (0.50)
Personal > Honors (0.46)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency

Liu, Rui, Xi, Jiatian, Jiang, Ziyue, Li, Haizhou

arXiv.org Artificial IntelligenceDec-8-2024

Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous \textit{FluentEditor} model, termed \textit{\textbf{FluentEditor2}}, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose \textit{hierarchical local acoustic smoothness constraint} to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose \textit{contrastive global prosody consistency constraint} to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that \textit{FluentEditor2} surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, A$^3$T, FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: \url{https://github.com/Ai-S2-Lab/FluentEditor2}.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.03719

Country:

North America (0.67)
Europe > Germany > Bremen > Bremen (0.14)
Asia > China > Inner Mongolia (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

Zhang, Mengyuan, Wang, Ruihui, Xia, Bo, Sun, Yuan, Zhao, Xiaobing

arXiv.org Artificial IntelligenceNov-14-2024

Large language models (LLMs) excel in high-resource languages but face notable challenges in low-resource languages like Mongolian. This paper addresses these challenges by categorizing capabilities into language abilities (syntax and semantics) and cognitive abilities (knowledge and reasoning). To systematically evaluate these areas, we developed MM-Eval, a specialized dataset based on Modern Mongolian Language Textbook I and enriched with WebQSP and MGSM datasets. Preliminary experiments on models including Qwen2-7B-Instruct, GLM4-9b-chat, Llama3.1-8B-Instruct, GPT-4, and DeepseekV2.5 revealed that: 1) all models performed better on syntactic tasks than semantic tasks, highlighting a gap in deeper language understanding; and 2) knowledge tasks showed a moderate decline, suggesting that models can transfer general knowledge from high-resource to low-resource contexts. The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge, and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in low-resource languages like Mongolian. The dataset is available at https://github.com/joenahm/MM-Eval.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.09492

Country:

Europe > Austria > Vienna (0.15)
Asia > China > Inner Mongolia (0.14)

Genre: Research Report (0.64)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback