akira
AKiRa: Augmentation Kit on Rays for optical video generation
Wang, Xi, Courant, Robin, Christie, Marc, Kalogeiton, Vicky
Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.
Akira's Machine Learning News -- Issue #38
In the following sections, I will introduce various articles and papers not only on the above contents but also on the following five topics. Since molecules have different interatomic distances depending on the nature of the target atoms, they proposed multi-scale Self-Attention, which adjusts the application of Attention according to the distance, and AFPS, which downsamples according to the Attention score. It showed good performance on quantum chemical molecular data sets.
Akira's Machine Learning news -- #26
In the following sections, I will introduce various articles and papers not only on the above contents but also on the following five topics. MERLOT: Multimodal Neural Script Knowledge Models Using as much as 6 million video data and accompanying subtitles, MERIOT is proposed to perform self-supervised learning on both temporal and spatial tasks. It does not use any label information but can achieve SotA performance. Also, the accuracy of the pre-training continues to increase even with 6 million data, which is considered a promising research direction for the future.
Akira's Machine Learning news -- #21
In the following sections, I will introduce various articles and papers not only on the above contents but also on the following five topics. Winning tickets in pre-training are transferable -- arxiv.org The results show that Winning Ticket is present regardless of whether pre-training is supervised or unsupervised.
Classifying old Japanese characters using CNN
Jiro's pick this week is CNN for Old Japanese Character Classification by one of my colleagues Akira Agata. Nowadays, I probably go many days without seeing a handwritten document. From computers, to smartphones, to TVs, to books, almost every character I see is a printed character. So it's refreshing to see a handwritten document from time to time. This demo by Akira uses deep learning (convolutional neural networks) to classify various handwritten Japanese characters.