AITopics | Timofte, Radu

Plotting

Timofte, Radu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Li, Xin, Yuan, Kun, Pei, Yajing, Lu, Yiting, Sun, Ming, Zhou, Chao, Chen, Zhibo, Timofte, Radu, Sun, Wei, Wu, Haoning, Zhang, Zicheng, Jia, Jun, Zhang, Zhichao, Cao, Linhan, Chen, Qiubo, Min, Xiongkuo, Lin, Weisi, Zhai, Guangtao, Sun, Jianhui, Wang, Tianyi, Li, Lei, Kong, Han, Wang, Wenxuan, Li, Bing, Luo, Cheng, Wang, Haiqiang, Chen, Xiangguang, Meng, Wenhui, Pan, Xiang, Shi, Huiying, Zhu, Han, Xu, Xiaozhong, Sun, Lei, Chen, Zhenzhong, Liu, Shan, Kong, Fangyuan, Fan, Haotian, Xu, Yifang, Xu, Haoran, Yang, Mengduo, Zhou, Jie, Li, Jiaze, Wen, Shijie, Xu, Mai, Li, Da, Yao, Shunyu, Du, Jiazhi, Zuo, Wangmeng, Li, Zhibo, He, Shuai, Ming, Anlong, Fu, Huiyuan, Ma, Huadong, Wu, Yong, Xue, Fie, Zhao, Guozhi, Du, Lina, Guo, Jie, Zhang, Yu, Zheng, Huimin, Chen, Junhao, Liu, Yue, Zhou, Dulan, Xu, Kele, Xu, Qisheng, Sun, Tao, Ding, Zhixiang, Hu, Yuhang

arXiv.org Artificial IntelligenceApr-17-2024

This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.

artificial intelligence, machine learning, video, (15 more...)

arXiv.org Artificial Intelligence

2404.11313

Country: Asia > China (1.00)

Genre:

Research Report (1.00)
Overview (0.74)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Event-Based Eye Tracking. AIS 2024 Challenge Survey

Wang, Zuowen, Gao, Chang, Wu, Zongwei, Conde, Marcos V., Timofte, Radu, Liu, Shih-Chii, Chen, Qinyu, Zha, Zheng-jun, Zhai, Wei, Han, Han, Liao, Bohao, Wu, Yuliang, Wan, Zengyu, Wang, Zhong, Cao, Yang, Tan, Ganchao, Chen, Jinze, Pei, Yan Ru, Brüers, Sasskia, Crouzet, Sébastien, McLelland, Douglas, Coenen, Oliver, Zhang, Baoheng, Gao, Yizhao, Li, Jingyuan, So, Hayden Kwok-Hay, Bich, Philippe, Boretti, Chiara, Prono, Luciano, Lică, Mircea, Dinucu-Jianu, David, Grîu, Cătălin, Lin, Xiaopeng, Ren, Hongwei, Cheng, Bojun, Zhang, Xinan, Vial, Valentin, Yezzi, Anthony, Tsai, James

arXiv.org Artificial IntelligenceApr-17-2024

This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research.

artificial intelligence, machine learning, representation, (20 more...)

arXiv.org Artificial Intelligence

2404.1177

Country:

Asia > China (0.28)
Europe > Netherlands > South Holland (0.14)

Genre: Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Information Technology (0.68)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation?

Ignatov, Dmitry, Ignatov, Andrey, Timofte, Radu

arXiv.org Artificial IntelligenceApr-15-2024

We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation. In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images. We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset. By conducting extensive experiments with our virtually modified dataset and validating on the original NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures, especially for the current state-of-the-art VPD model. To the best of our knowledge, this is the first work that augments a real-world dataset with randomly generated virtual 3D objects for monocular depth estimation. We make our ANYU dataset publicly available in two training configurations with 10% and 100% additional synthetically enriched RGB-D pairs of training images, respectively, for efficient training and empirical exploration of virtual augmentation at https://github.com/ABrain-One/ANYU

artificial intelligence, image understanding, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2404.09469

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Burchi, Maxime, Puvvada, Krishna C., Balam, Jagadeesh, Ginsburg, Boris, Timofte, Radu

arXiv.org Artificial IntelligenceMar-13-2024

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute average-WER reduction of 11.9% in comparison to the original baseline. Finally, we demonstrate the ability of the proposed model to perform audio-only, visual-only, and audio-visual speech recognition at test time.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2405.12983

Country:

Europe > Germany (0.14)
Asia > South Korea (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

High-Quality Image Restoration Following Human Instructions

Conde, Marcos V., Geigle, Gregor, Timofte, Radu

arXiv.org Artificial IntelligenceJan-31-2024

Image restoration is a fundamental problem that involves recovering a high-quality clean image from its degraded observation. All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model. In this work, we present the first approach that uses human-written instructions to guide the image restoration model. Given natural language prompts, our model can recover high-quality images from their degraded counterparts, considering multiple degradation types. Our method, InstructIR, achieves state-of-the-art results on several restoration tasks including image denoising, deraining, deblurring, dehazing, and (low-light) image enhancement. InstructIR improves +1dB over previous all-in-one restoration methods. Moreover, our dataset and results represent a novel benchmark for new research on text-guided image restoration and enhancement. Our code, datasets and models are available at: https://github.com/mv-lab/InstructIR

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2401.16468

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

A Study of Forward-Forward Algorithm for Self-Supervised Learning

Brenig, Jonas, Timofte, Radu

arXiv.org Artificial IntelligenceDec-8-2023

Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.

artificial intelligence, backpropagation, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2309.11955

Country:

Europe > Netherlands (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Event-Free Moving Object Segmentation from Moving Ego Vehicle

Zhou, Zhuyun, Wu, Zongwei, Paudel, Danda Pani, Boutteau, Rémi, Yang, Fan, Van Gool, Luc, Timofte, Radu, Ginhac, Dominique

arXiv.org Artificial IntelligenceNov-28-2023

Moving object segmentation (MOS) in dynamic scenes is challenging for autonomous driving, especially for sequences obtained from moving ego vehicles. Most state-of-the-art methods leverage motion cues obtained from optical flow maps. However, since these methods are often based on optical flows that are pre-computed from successive RGB frames, this neglects the temporal consideration of events occurring within inter-frame and limits the practicality of these methods in real-life situations. To address these limitations, we propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow. To foster research in this area, we first introduce a novel large-scale dataset called DSEC-MOS for moving object segmentation from moving ego vehicles. Subsequently, we devise EmoFormer, a novel network able to exploit the event data. For this purpose, we fuse the event prior with spatial semantic maps to distinguish moving objects from the static background, adding another level of dense supervision around our object of interest - moving ones. Our proposed network relies only on event data for training but does not require event input during inference, making it directly comparable to frame-only methods in terms of efficiency and more widely usable in many application cases. An exhaustive comparison with 8 state-of-the-art video object segmentation methods highlights a significant performance improvement of our method over all other methods. Project Page: https://github.com/ZZY-Zhou/DSEC-MOS.

artificial intelligence, machine learning, segmentation, (19 more...)

arXiv.org Artificial Intelligence

2305.00126

Country: Europe (0.14)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.48)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Geigle, Gregor, Jain, Abhay, Timofte, Radu, Glavaš, Goran

arXiv.org Artificial IntelligenceOct-2-2023

Modular vision-language models (Vision-LLMs) align pretrained image encoders with frozen large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most researchers and practitioners. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware and using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2307.0693

Country:

Asia > Middle East > UAE (0.14)
North America > United States > Maryland (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(2 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback

3D-Aware Video Generation

Bahmani, Sherwin, Park, Jeong Joon, Paschalidou, Despoina, Tang, Hao, Wetzstein, Gordon, Guibas, Leonidas, Van Gool, Luc, Timofte, Radu

arXiv.org Artificial IntelligenceAug-9-2023

Generative models have emerged as an essential building block for many image synthesis and editing tasks. Recent advances in this field have also enabled high-quality 3D or video content to be generated that exhibits either multi-view or temporal consistency. With our work, we explore 4D generative adversarial networks (GANs) that learn unconditional generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos. We show that our method learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.

artificial intelligence, computer vision, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2206.14797

Country:

Europe (0.46)
Asia > Japan > Honshū > Chūbu (0.14)

Genre: Research Report (1.00)

Industry:

Media (0.68)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Geigle, Gregor, Timofte, Radu, Glavaš, Goran

arXiv.org Artificial IntelligenceJun-14-2023

Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. The bulk of the evaluation of these models is, however, performed with English text only: the costly creation of language-specific image-caption datasets has limited multilingual VL benchmarks to a handful of high-resource languages. In this work, we introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of 1000 ImageNet labels to 92 languages, built without resorting to machine translation (MT) or requiring manual annotation. We instead automatically obtain reliable translations of ImageNext concepts by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 8 different publicly available multilingual CLIP models on zero-shot image classification (ZS-IC) for each of the 92 Babel-ImageNet languages, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance on Babel-ImageNet highly correlates with their performance in image-text retrieval, validating that Babel-ImageNet is suitable for estimating the quality of the multilingual VL representation spaces for the vast majority of languages that lack gold image-text data. Finally, we show that the performance of multilingual CLIP for low-resource languages can be drastically improved via cheap, parameter-efficient language-specific training. We make our code and data publicly available: \url{https://github.com/gregor-ge/Babel-ImageNet}

artificial intelligence, natural language, survey article, (18 more...)

arXiv.org Artificial Intelligence

2306.08658

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback