Goto

Collaborating Authors

 pred



Automated Classification of Model Errors on ImageNet

Neural Information Processing Systems

While the ImageNet dataset has been driving computer vision research over the past decade, significant label noise and ambiguity have made top-1 accuracy an insufficient measure of further progress.


Unsupervised Adversarial Invariance

Ayush Jaiswal, Rex Yue Wu, Wael Abd-Almageed, Prem Natarajan

Neural Information Processing Systems

Data representations that contain all the information about target variables but are invariant to nuisance factors benefit supervised learning algorithms by preventing them from learning associations between these factors and the targets, thus reducing overfitting.


Two-StreamNetworkforSignLanguageRecognition andTranslation

Neural Information Processing Systems

Weadoptidentical dataaugmentationsforRGBvideos andheatmap sequences to maintain spatial and temporal consistency. SingleStream-SLTwhich only utilizes asingle video encoder without modelling keypoints serves as our baseline. TwoStream-SLT-V/K/J denotes the network where only one translation network is attached onto the video head/keypoint head/joint head. The averaged probabilities are used to decode text sequences. In each of the variants, only a single translation network is appended onto the video head, keypoint head, or joint head.


SupplementaryMaterialsfor TVLT: TextlessVision-LanguageTransformer

Neural Information Processing Systems

LanguageInput CMU-MOSEI(A2) HT100MYTT-S Audio 75.3 76.8 Text(ASR-SpeechBrain) 76.5 76.6 Text(ASR-Google) 77.1 77.8 Text(GTTranscripts) 78.9 79.1 Table 2 shows the results ofTVLTon CMUMOSEI sentiment analysis withthe following different inputs: audio, ASR-based text, and ground-truth text transcriptions. ASR-Google and ASR-SpeechBrain refer to Google Cloud API and SpeechBrain, respectively (see main paper Sec. He is underhousearrestandhis mother takesaway his XboxesandTVsissort of a little bit of additionalpunishment. 0.0 -1.0 0.0 0.0 And then last year we had 260 something come outtothedance 1.0 2.0 2.0 1.0 Weusetheconfigurations asfollows: (1)Wesetasingle speech event to have a duration within [0.3s, 1.2s], so that an event is likely to cover a single word. If the silence gap is too large, it is usually a stop between two words. Specifically,weconstruct a 4-layer transformer language model that attends toTVLT encoder outputs via cross-attentions and jointly train the encoder anddecoder.



PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

Neural Information Processing Systems

Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https://github.com/PRED4pc/PRED.


A Categorical Analysis of Large Language Models and Why LLMs Circumvent the Symbol Grounding Problem

Floridi, Luciano, Jia, Yiyang, Tohmé, Fernando

arXiv.org Artificial Intelligence

This paper presents a formal, categorical framework for analysing how humans and large language models (LLMs) transform content into truth-evaluated propositions about a state space of possible worlds W , in order to argue that LLMs do not solve but circumvent the symbol grounding problem.


SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision

Hamidullah, Yasser, Yazdani, Shakib, Oguz, Cennet, van Genabith, Josef, España-Bonet, Cristina

arXiv.org Artificial Intelligence

Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.