AITopics | tvlt

Collaborating Authors

tvlt

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TVLT: TextlessVision-LanguageTransformer

Neural Information Processing SystemsFeb-8-2026, 12:17:46 GMT

Thechallenge liesinthedifference between textand acoustic signals; textisdiscrete and dense ininformation, while acoustic signals are continuous and sparse in information [26; 7]. Therefore, modality-specific architectures have beenusedtomodel datafromdifferent modalities.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

TVLT: Textless Vision-Language Transformer

Neural Information Processing SystemsDec-24-2025, 02:08:03 GMT

name change, textless vision-language transformer, tvlt, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.61)

Add feedback

3ea3134345f2e6228a29f35b86bce24d-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 08:34:01 GMT

artificial intelligence, machine learning, tvlt, (16 more...)

Neural Information Processing Systems

Country: Europe > Serbia (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.70)

Add feedback

3ea3134345f2e6228a29f35b86bce24d-Paper-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 08:33:57 GMT

representation, tvlt, video, (14 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

TVLT: Textless Vision-Language Transformer

Neural Information Processing SystemsOct-10-2024, 18:42:15 GMT

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR). TVLT is trained by reconstructing masked patches of continuous video frames and audio spectrograms (masked autoencoding) and contrastive modeling to align video and audio. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without assuming the prior existence of text. Our code and checkpoints are available at: https://github.com/zinengtang/TVLT

representation, textless vision-language transformer, tvlt, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.65)

Add feedback

We Care: Multimodal Depression Detection and Knowledge Infused Mental Health Therapeutic Response Generation

Moon, Palash, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceJun-15-2024

The detection of depression through non-verbal cues has gained significant attention. Previous research predominantly centred on identifying depression within the confines of controlled laboratory environments, often with the supervision of psychologists or counsellors. Unfortunately, datasets generated in such controlled settings may struggle to account for individual behaviours in real-life situations. In response to this limitation, we present the Extended D-vlog dataset, encompassing a collection of 1, 261 YouTube vlogs. Additionally, the emergence of large language models (LLMs) like GPT3.5, and GPT4 has sparked interest in their potential they can act like mental health professionals. Yet, the readiness of these LLM models to be used in real-life settings is still a concern as they can give wrong responses that can harm the users. We introduce a virtual agent serving as an initial contact for mental health patients, offering Cognitive Behavioral Therapy (CBT)-based responses. It comprises two core functions: 1. Identifying depression in individuals, and 2. Delivering CBT-based therapeutic responses. Our Mistral model achieved impressive scores of 70.1% and 30.9% for distortion assessment and classification, along with a Bert score of 88.7%. Moreover, utilizing the TVLT model on our Multimodal Extended D-vlog Dataset yielded outstanding results, with an impressive F1-score of 67.8%

dataset, depression, vlog, (15 more...)

arXiv.org Artificial Intelligence

2406.10561

Country:

Europe > Iceland > Capital Region > Reykjavik (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TVLT: Textless Vision-Language Transformer

Tang, Zineng, Cho, Jaemin, Nie, Yixin, Bansal, Mohit

arXiv.org Artificial IntelligenceNov-2-2022

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR). TVLT is trained by reconstructing masked patches of continuous video frames and audio spectrograms (masked autoencoding) and contrastive modeling to align video and audio. TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering, image retrieval, video retrieval, and multimodal sentiment analysis, with 28x faster inference speed and only 1/3 of the parameters. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without assuming the prior existence of text. Our code and checkpoints are available at: https://github.com/zinengtang/TVLT

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2209.14156

Country:

North America > United States (0.14)
Europe > Serbia (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report > New Finding (0.86)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback