Goto

Collaborating Authors

 joint modeling


Joint Modeling of Visual Objects and Relations for Scene Graph Generation (Supplementary Material)

Neural Information Processing Systems

Now, we can exactly derive that q (G) = ˆ p( G|I) . The definitions of potential function φ and ψ follow those in JM-SGG model. Figure 1: The scene graphs generated by JM-SGG model. In these examples, factor update is able to correct some wrong relation labels ( e.g.


Joint Modeling of Visual Objects and Relations for Scene Graph Generation

Neural Information Processing Systems

An in-depth scene understanding usually requires recognizing all the objects and their relations in an image, encoded as a scene graph. Most existing approaches for scene graph generation first independently recognize each object and then predict their relations independently. Though these approaches are very efficient, they ignore the dependency between different objects as well as between their relations. In this paper, we propose a principled approach to jointly predict the entire scene graph by fully capturing the dependency between different objects and between their relations. Specifically, we establish a unified conditional random field (CRF) to model the joint distribution of all the objects and their relations in a scene graph. We carefully design the potential functions to enable relational reasoning among different objects according to knowledge graph embedding methods. We further propose an efficient and effective algorithm for inference based on mean-field variational inference, in which we first provide a warm initialization by independently predicting the objects and their relations according to the current model, followed by a few iterations of relational reasoning. Experimental results on both the relationship retrieval and zero-shot relationship retrieval tasks prove the efficiency and efficacy of our proposed approach.


DySTAN: Joint Modeling of Sedentary Activity and Social Context from Smartphone Sensors

Sneh, Aditya, Sahu, Nilesh Kumar, Gupta, Snehil, Lone, Haroon R.

arXiv.org Artificial Intelligence

Accurately recognizing human context from smartphone sensor data remains a significant challenge, especially in sedentary settings where activities such as studying, attending lectures, relaxing, and eating exhibit highly similar inertial patterns. Furthermore, social context plays a critical role in understanding user behavior, yet is often overlooked in mobile sensing research. To address these gaps, we introduce LogMe, a mobile sensing application that passively collects smartphone sensor data (accelerometer, gyroscope, magnetometer, and rotation vector) and prompts users for hourly self-reports capturing both sedentary activity and social context. Using this dual-label dataset, we propose DySTAN (Dynamic Cross-Stitch with Task Attention Network), a multi-task learning framework that jointly classifies both context dimensions from shared sensor inputs. It integrates task-specific layers with cross-task attention to model subtle distinctions effectively. DySTAN improves sedentary activity macro F1 scores by 21.8% over a single-task CNN-BiLSTM-GRU (CBG) model and by 8.2% over the strongest multi-task baseline, Sluice Network (SN). These results demonstrate the importance of modeling multiple, co-occurring context dimensions to improve the accuracy and robustness of mobile context recognition.


Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling

Gu, Yue, Du, Zhihao, Shi, Ying, Zhang, Shiliang, Chen, Qian, Han, Jiqing

arXiv.org Artificial Intelligence

Abstract--Recently, cross-attention-based contextual automatic speech recognition (ASR) models have made notable advancements in recognizing personalized biasing phrases. However, the effectiveness of cross-attention is affected by variations in biasing information volume, especially when the length of the biasing list increases significantly. We find that, regardless of the length of the biasing list, only a limited amount of biasing information is most relevant to a specific ASR intermediate representation. Therefore, by identifying and integrating the most relevant biasing information rather than the entire biasing list, we can alleviate the effects of variations in biasing information volume for contextual ASR. T o this end, we propose a purified semantic correlation joint modeling (PSC-Joint) approach. In PSC-Joint, we define and calculate three semantic correlations between the ASR intermediate representations and biasing information from coarse to fine: list-level, phrase-level, and token-level. Then, the three correlations are jointly modeled to produce their intersection, so that the most relevant biasing information across various granularities is highlighted and integrated for contextual recognition. In addition, to reduce the computational cost introduced by the joint modeling of three semantic correlations, we also propose a purification mechanism based on a grouped-and-competitive strategy to filter out irrelevant biasing phrases. Compared with baselines, our PSC-Joint approach achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech, across biasing lists of varying lengths. N recent years, remarkable advancements have been made on end-to-end automatic speech recognition (E2E ASR), such as connectionist temporal classification [1], recurrent neural network transducer [2], [3], and attention encoder-decoder [4]-[7].


Joint Modeling of Visual Objects and Relations for Scene Graph Generation (Supplementary Material)

Neural Information Processing Systems

Now, we can exactly derive that q (G) = ˆ p( G|I) . The definitions of potential function φ and ψ follow those in JM-SGG model. Figure 1: The scene graphs generated by JM-SGG model. In these examples, factor update is able to correct some wrong relation labels ( e.g.


TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Tseng, Liang-Hsuan, Chen, Yi-Chang, Lee, Kuan-Yi, Shiu, Da-Shan, Lee, Hung-yi

arXiv.org Artificial Intelligence

Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.


Joint Modeling of Visual Objects and Relations for Scene Graph Generation

Neural Information Processing Systems

An in-depth scene understanding usually requires recognizing all the objects and their relations in an image, encoded as a scene graph. Most existing approaches for scene graph generation first independently recognize each object and then predict their relations independently. Though these approaches are very efficient, they ignore the dependency between different objects as well as between their relations. In this paper, we propose a principled approach to jointly predict the entire scene graph by fully capturing the dependency between different objects and between their relations. Specifically, we establish a unified conditional random field (CRF) to model the joint distribution of all the objects and their relations in a scene graph.


Joint Modeling of Search and Recommendations Via an Unified Contextual Recommender (UniCoRn)

Bhattacharya, Moumita, Ostuni, Vito, Lamkhede, Sudarshan

arXiv.org Artificial Intelligence

Search and recommendation systems are essential in many services, and they are often developed separately, leading to complex maintenance and technical debt. In this paper, we present a unified deep learning model that efficiently handles key aspects of both tasks.


Is It Really Useful to Jointly Parse Constituency and Dependency Trees? A Revisit

Gu, Yanggang, Hou, Yang, Wang, Zhefeng, Duan, Xinyu, Li, Zhenghua

arXiv.org Artificial Intelligence

This work visits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. Compared with previous works, we make progress in four aspects: (1) adopting a much more efficient decoding algorithm, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components for constituent-dependency interaction, (4) gaining more insights via in-depth experiments and analysis.


Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

Zhou, Xinyu, Chen, Delong, Chen, Yudong

arXiv.org Artificial Intelligence

This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules. We hypothesize that Large Language Models (LLMs) with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We conduct two sets of experiments: 1) Prosodic structure prediction, a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs, and 2) Further integrating dialogue response and a wide array of linguistic features using a unified encoding format. Our results indicate that the LLM-based approach is a promising direction for building unified spoken dialogue systems.