pred
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > United Kingdom > England > Staffordshire (0.04)
- Asia > China (0.04)
- (6 more...)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Leisure & Entertainment > Sports (1.00)
- (4 more...)
- North America > United States (0.14)
- North America > Canada > Quebec > Montreal (0.04)
Two-StreamNetworkforSignLanguageRecognition andTranslation
Weadoptidentical dataaugmentationsforRGBvideos andheatmap sequences to maintain spatial and temporal consistency. SingleStream-SLTwhich only utilizes asingle video encoder without modelling keypoints serves as our baseline. TwoStream-SLT-V/K/J denotes the network where only one translation network is attached onto the video head/keypoint head/joint head. The averaged probabilities are used to decode text sequences. In each of the variants, only a single translation network is appended onto the video head, keypoint head, or joint head.
SupplementaryMaterialsfor TVLT: TextlessVision-LanguageTransformer
LanguageInput CMU-MOSEI(A2) HT100MYTT-S Audio 75.3 76.8 Text(ASR-SpeechBrain) 76.5 76.6 Text(ASR-Google) 77.1 77.8 Text(GTTranscripts) 78.9 79.1 Table 2 shows the results ofTVLTon CMUMOSEI sentiment analysis withthe following different inputs: audio, ASR-based text, and ground-truth text transcriptions. ASR-Google and ASR-SpeechBrain refer to Google Cloud API and SpeechBrain, respectively (see main paper Sec. He is underhousearrestandhis mother takesaway his XboxesandTVsissort of a little bit of additionalpunishment. 0.0 -1.0 0.0 0.0 And then last year we had 260 something come outtothedance 1.0 2.0 2.0 1.0 Weusetheconfigurations asfollows: (1)Wesetasingle speech event to have a duration within [0.3s, 1.2s], so that an event is likely to cover a single word. If the silence gap is too large, it is usually a stop between two words. Specifically,weconstruct a 4-layer transformer language model that attends toTVLT encoder outputs via cross-attentions and jointly train the encoder anddecoder.
- North America > United States > Virginia (0.04)
- North America > United States > Maryland (0.04)
- North America > Canada > Newfoundland and Labrador > Newfoundland (0.04)
- (3 more...)
PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds
Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https://github.com/PRED4pc/PRED.
- North America > Canada (0.14)
- Europe > France (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (8 more...)
SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision
Hamidullah, Yasser, Yazdani, Shakib, Oguz, Cennet, van Genabith, Josef, España-Bonet, Cristina
Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.
- Europe > Austria > Vienna (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (8 more...)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > United Kingdom > England > Staffordshire (0.04)
- Asia > China (0.04)
- (6 more...)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Leisure & Entertainment > Sports (1.00)
- (4 more...)