AITopics | Ren, Jian

Collaborating Authors

Ren, Jian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Wu, Yushu, Zhang, Zhixing, Li, Yanyu, Xu, Yanwu, Kag, Anil, Sui, Yang, Coskun, Huseyin, Ma, Ke, Lebedev, Aleksei, Hu, Ju, Metaxas, Dimitris, Wang, Yanzhi, Tulyakov, Sergey, Ren, Jian

arXiv.org Artificial IntelligenceDec-13-2024

We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.

artificial intelligence, diffusion model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2412.10494

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

Add feedback

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

Kag, Anil, Coskun, Huseyin, Chen, Jierun, Cao, Junli, Menapace, Willi, Siarohin, Aliaksandr, Tulyakov, Sergey, Ren, Jian

arXiv.org Artificial IntelligenceNov-7-2024

Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective \emph{asymmetric} architecture, where the distribution of convolutional and transformer blocks is \emph{asymmetric}, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2411.04967

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient Training with Denoised Neural Weights

Gong, Yifan, Zhan, Zheng, Li, Yanyu, Idelbayev, Yerlan, Zharkov, Andrey, Aberman, Kfir, Tulyakov, Sergey, Wang, Yanzhi, Ren, Jian

arXiv.org Artificial IntelligenceJul-16-2024

Good weight initialization serves as an effective measure to reduce the training cost of a deep neural network (DNN) model. The choice of how to initialize parameters is challenging and may require manual tuning, which can be time-consuming and prone to human error. To overcome such limitations, this work takes a novel step towards building a weight generator to synthesize the neural weights for initialization. We use the image-to-image translation task with generative adversarial networks (GANs) as an example due to the ease of collecting model weights spanning a wide range. Specifically, we first collect a dataset with various image editing concepts and their corresponding trained weights, which are later used for the training of the weight generator. To address the different characteristics among layers and the substantial number of weights to be predicted, we divide the weights into equal-sized blocks and assign each block an index. Subsequently, a diffusion model is trained with such a dataset using both text conditions of the concept and the block indexes. By initializing the image translation model with the denoised weights predicted by our diffusion model, the training requires only 43.3 seconds. Compared to training from scratch (i.e., Pix2pix), we achieve a 15x training time acceleration for a new concept while obtaining even better image generation quality.

artificial intelligence, machine learning, weight generator, (13 more...)

arXiv.org Artificial Intelligence

2407.11966

Genre: Research Report (1.00)

Industry: Media (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TextCraftor: Your Text Encoder Can be Image Quality Controller

Li, Yanyu, Liu, Xian, Kag, Anil, Hu, Ju, Idelbayev, Yerlan, Sagar, Dhritiman, Wang, Yanzhi, Tulyakov, Sergey, Ren, Jian

arXiv.org Artificial IntelligenceMar-27-2024

Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2403.18978

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Menapace, Willi, Siarohin, Aliaksandr, Skorokhodov, Ivan, Deyneka, Ekaterina, Chen, Tsai-Shien, Kag, Anil, Fang, Yuwei, Stoliar, Aleksei, Ricci, Elisa, Ren, Jian, Tulyakov, Sergey

arXiv.org Artificial IntelligenceFeb-22-2024

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.14797

Country: Europe > Italy (0.28)

Genre: Research Report (0.83)

Industry:

Media (1.00)
Leisure & Entertainment > Sports > Track & Field (0.67)
Automobiles & Trucks (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

Gong, Yifan, Zhan, Zheng, Jin, Qing, Li, Yanyu, Idelbayev, Yerlan, Liu, Xian, Zharkov, Andrey, Aberman, Kfir, Tulyakov, Sergey, Wang, Yanzhi, Ren, Jian

arXiv.org Artificial IntelligenceJan-11-2024

One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models, such as Stable Diffusion, to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with diffusion models. However, unlike text-to-image diffusion models, each distilled GAN is specialized for a specific image editing task, necessitating costly training efforts to obtain models for various concepts. In this work, we introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient? To achieve this goal, we propose a series of innovative techniques. First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch. Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model. Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time. Extensive experiments show that we can efficiently empower GANs with the ability to perform real-time high-quality image editing on mobile devices with remarkable reduced training cost and storage for each concept.

artificial intelligence, diffusion model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2401.06127

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Li, Yanyu, Wang, Huan, Jin, Qing, Hu, Ju, Chemerys, Pavlo, Fu, Yun, Wang, Yanzhi, Tulyakov, Sergey, Ren, Jian

arXiv.org Artificial IntelligenceOct-16-2023

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

artificial intelligence, distillation, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2306.0098

Country:

North America > United States (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry:

Information Technology > Security & Privacy (0.66)
Media > Photography (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Rethinking Vision Transformers for MobileNet Size and Speed

Li, Yanyu, Hu, Ju, Wen, Yang, Evangelidis, Georgios, Salahi, Kamyar, Wang, Yanzhi, Tulyakov, Sergey, Ren, Jian

arXiv.org Artificial IntelligenceSep-4-2023

With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose a novel supernet with low latency and high parameter efficiency. We further introduce a novel fine-grained joint search strategy for transformer models that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve 3.5% higher top-1 accuracy than MobileNetV2 on ImageNet-1K with similar latency and parameters. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.

artificial intelligence, machine learning, transformer, (16 more...)

arXiv.org Artificial Intelligence

2212.08059

Country: Asia > Middle East > Israel (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CLGT: A Graph Transformer for Student Performance Prediction in Collaborative Learning

Peng, Tianhao, Liang, Yu, Wu, Wenjun, Ren, Jian, Pengrui, Zhao, Pu, Yanjun

arXiv.org Artificial IntelligenceJul-30-2023

Modeling and predicting the performance of students in collaborative learning paradigms is an important task. Most of the research presented in literature regarding collaborative learning focuses on the discussion forums and social learning networks. There are only a few works that investigate how students interact with each other in team projects and how such interactions affect their academic performance. In order to bridge this gap, we choose a software engineering course as the study subject. The students who participate in a software engineering course are required to team up and complete a software project together. In this work, we construct an interaction graph based on the activities of students grouped in various teams. Based on this student interaction graph, we present an extended graph transformer framework for collaborative learning (CLGT) for evaluating and predicting the performance of students. Moreover, the proposed CLGT contains an interpretation module that explains the prediction results and visualizes the student interaction patterns. The experimental results confirm that the proposed CLGT outperforms the baseline models in terms of performing predictions based on the real-world datasets. Moreover, the proposed CLGT differentiates the students with poor performance in the collaborative learning paradigm and gives teachers early warnings, so that appropriate assistance can be provided.

artificial intelligence, machine learning, student, (16 more...)

arXiv.org Artificial Intelligence

2308.02038

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.93)

Industry:

Education > Curriculum > Subject-Specific Education (0.55)
Education > Assessment & Standards > Student Performance (0.40)

Technology:

Information Technology > Communications > Collaboration (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
(2 more...)

Add feedback

Real-Time Neural Light Field on Mobile Devices

Cao, Junli, Wang, Huan, Chemerys, Pavlo, Shakhrai, Vladislav, Hu, Ju, Fu, Yun, Makoviichuk, Denys, Tulyakov, Sergey, Ren, Jian

arXiv.org Artificial IntelligenceJun-24-2023

Recent efforts in Neural Rendering Fields (NeRF) have shown impressive results on novel view synthesis by utilizing implicit neural representation to represent 3D scenes. Due to the process of volumetric rendering, the inference speed for NeRF is extremely slow, limiting the application scenarios of utilizing NeRF on resource-constrained hardware, such as mobile devices. Many works have been conducted to reduce the latency of running NeRF models. However, most of them still require high-end GPU for acceleration or extra storage memory, which is all unavailable on mobile devices. Another emerging direction utilizes the neural light field (NeLF) for speedup, as only one forward pass is performed on a ray to predict the pixel color. Nevertheless, to reach a similar rendering quality as NeRF, the network in NeLF is designed with intensive computation, which is not mobile-friendly. In this work, we propose an efficient network that runs in real-time on mobile devices for neural rendering. We follow the setting of NeLF to train our network. Unlike existing works, we introduce a novel network architecture that runs efficiently on mobile devices with low latency and small size, i.e., saving $15\times \sim 24\times$ storage compared with MobileNeRF. Our model achieves high-resolution generation while maintaining real-time inference for both synthetic and real-world scenes on mobile devices, e.g., $18.04$ms (iPhone 13) for rendering one $1008\times756$ image of real 3D scenes. Additionally, we achieve similar image quality as NeRF and better quality than MobileNeRF (PSNR $26.15$ vs. $25.91$ on the real-world forward-facing dataset).

artificial intelligence, machine learning, mobiler2l, (19 more...)

arXiv.org Artificial Intelligence

2212.08057

Country: North America > United States > Oklahoma > Beaver County (0.61)

Genre: Research Report (0.82)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback