AITopics | Zhang, David Junhao

Collaborating Authors

Zhang, David Junhao

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Zhang, David Junhao, Paiss, Roni, Zada, Shiran, Karnad, Nikhil, Jacobs, David E., Pritch, Yael, Mosseri, Inbar, Shou, Mike Zheng, Wadhwa, Neal, Ruiz, Nataniel

arXiv.org Artificial IntelligenceNov-7-2024

Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.

artificial intelligence, machine learning, video, (15 more...)

arXiv.org Artificial Intelligence

2411.05003

Country: Asia (0.14)

Genre: Research Report (0.50)

Industry:

Media > Television (1.00)
Media > Photography (1.00)
Media > Film (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Ni, Jinjie, Song, Yifan, Ghosal, Deepanway, Li, Bo, Zhang, David Junhao, Yue, Xiang, Xue, Fuzhao, Zheng, Zian, Zhang, Kaichen, Shah, Mahir, Jain, Kabir, You, Yang, Shieh, Michael

arXiv.org Artificial IntelligenceOct-18-2024

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.13754

Country:

Asia (0.14)
North America > United States (0.14)

Genre: Research Report (0.81)

Industry:

Media (0.92)
Leisure & Entertainment > Sports (0.67)
Education (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Making Vision Transformers Efficient from A Token Sparsification View

Chang, Shuning, Wang, Pichao, Lin, Ming, Wang, Fan, Zhang, David Junhao, Jin, Rong, Shou, Mike Zheng

arXiv.org Artificial IntelligenceMar-30-2023

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. Code is available at http://github.com/changsn/STViT-R

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2303.08685

Country: Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.36)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.34)

Add feedback