AITopics | tribert

Collaborating Authors

tribert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)

Neural Information Processing SystemsApr-25-2026, 21:40:27 GMT

Recall that for the n-way multiple choice setting, n 1 choices are negative pairs and only one pair is positive. Accordingly, for n = 4, 3 distractors are sampled, each with an incorrect pose embedding, while the 4th choice contains the matching pose embedding for the given vision and audio embeddings. In other words, the fusion embedding consisting of the vision and audio embeddings is kept as the anchor while negatives are sampled from the pose embeddings only. Of the 3 negative pose embeddings, 2 are considered "easy" negatives, sampled randomly from the entire training set, while the last one is a "hard" negative, sampled randomly from a pool of 25 embeddings corresponding to the 25 nearest neighbours of the anchor vision embedding. In the n = 3case, 2 hard negatives and no easy negatives are used, with the same nearest neighbour sampling method based on the anchorshared weights embedding.

artificial intelligence, machine learning, modality, (12 more...)

Neural Information Processing Systems

Country: North America > Canada (0.15)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Vision (0.49)

Add feedback

51200d29d1fc15f5a71c1dab4bb54f7c-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 21:40:24 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario (0.28)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (Supplementary Materials)

Neural Information Processing SystemsFeb-8-2026, 16:04:36 GMT

Figure 1 shows a diagram of the training scheme for the cross-modal retrieval module. Each multiple choice consists of the correct vision+audio fusion embedding along with a pose embedding. Experimental results if one of the modality is erased. Type of Masking SDR () SIR () SAR () Masking is used for visual modality 7.82 14.39 10.65 Masking is used for pose modality 12.06 18.34 14.17 15% random masking for both visual and pose modality 12.34 18.76 14.37 In this paper, we are using sound separation as our primary task. Therefore, we do not consider masking for the audio modality.

artificial intelligence, machine learning, modality, (12 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.15)
North America > Canada > British Columbia (0.05)

Technology:

Information Technology > Artificial Intelligence > Vision (0.49)
Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

51200d29d1fc15f5a71c1dab4bb54f7c-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 16:04:34 GMT

architecture, modality, representation, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States (0.04)
North America > Canada > British Columbia (0.04)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

TriBERT: Human-centric Audio-visual Representation Learning

Neural Information Processing SystemsDec-24-2025, 03:01:50 GMT

The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.

human-centric audio-visual representation learning, name change, tribert, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

TriBERT: Human-centric Audio-visual Representation Learning

Neural Information Processing SystemsOct-10-2024, 08:51:05 GMT

architecture, human-centric audio-visual representation learning, tribert, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education

Zeng, Zijie, Sha, Lele, Li, Yuheng, Yang, Kaixun, Gašević, Dragan, Chen, Guanliang

arXiv.org Artificial IntelligenceDec-25-2023

The recent large language models (LLMs), e.g., ChatGPT, have been able to generate human-like and fluent responses when provided with specific instructions. While admitting the convenience brought by technological advancement, educators also have concerns that students might leverage LLMs to complete their writing assignments and pass them off as their original work. Although many AI content detection studies have been conducted as a result of such concerns, most of these prior studies modeled AI content detection as a classification problem, assuming that a text is either entirely human-written or entirely AI-generated. In this study, we investigated AI content detection in a rarely explored yet realistic setting where the text to be detected is collaboratively written by human and generative LLMs (i.e., hybrid text). We first formalized the detection task as identifying the transition points between human-written content and AI-generated content from a given hybrid text (boundary detection). Then we proposed a two-step approach where we (1) separated AI-generated content from human-written content during the encoder training process; and (2) calculated the distances between every two adjacent prototypes and assumed that the boundaries exist between the two adjacent prototypes that have the furthest distance from each other. Through extensive experiments, we observed the following main findings: (1) the proposed approach consistently outperformed the baseline methods across different experiment settings; (2) the encoder training process can significantly boost the performance of the proposed approach; (3) when detecting boundaries for single-boundary hybrid essays, the proposed approach could be enhanced by adopting a relatively large prototype size, leading to a 22% improvement in the In-Domain evaluation and an 18% improvement in the Out-of-Domain evaluation.

boundary, hybrid essay, tribert, (14 more...)

arXiv.org Artificial Intelligence

2307.12267

Country:

Oceania > Australia (0.04)
North America > United States (0.04)
Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Education > Curriculum > Subject-Specific Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback