Goto

Collaborating Authors

 tang


How Chinese short dramas became AI content machines

MIT Technology Review

The viral short dramas are increasingly being created entirely with AI, with hundreds of new shows spun up each day. In a dimly lit bedroom, a frightened young woman is thrown onto a bed by a tall, muscular man. He grabs her hand, and flame-like vines crawl across her body, fusing with her flesh. A dragon-shaped tattoo appears across her chest. "Two months," the man says. "Give me an heir, or I will eat you."



Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting

Neural Information Processing Systems

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models (LLMs) while maintaining an identical sampling distribution. However, the conventional approach of training separate draft model to achieve a satisfactory token acceptance rate can be costly and impractical. In this paper, we propose a novel self-speculative decoding framework \emph{Kangaroo} with \emph{double} early exiting strategy, which leverages the shallow sub-network and the \texttt{LM Head} of the well-trained target LLM to construct a self-drafting model. Then, the self-verification stage only requires computing the remaining layers over the \emph{early-exited} hidden states in parallel. To bridge the representation gap between the sub-network and the full model, we train a lightweight and efficient adapter module on top of the sub-network.


d71a4a6c796cacd9b8a298589943cdf3-Supplemental-Conference.pdf

Neural Information Processing Systems

The codes related todataset, model, loss, training pipeline and experiment areenclosed. Cross-Domain MAFLAFLWMAFLWR 300W Supervised learning TCDCN[13] XX 7.95 7.65 - 5.54 MTCNN[12] XX 5.39 6.90 - WingLoss[3] XX - - - 4.04 Generative modeling based DeformingAE[9] OX 5.45 - - ImGen.[4] After the initialization period, the intra pseudo-paired dataxd1)d1, xd2)d2 and inter pseudo-paired dataxd1)d2,xd2)d1 aregenerated with latent space exploration described atSection 3.2. Atlastsemanticmatchingloss LM are utilized to get intra semantic matching lossLM1 and inter semantic matching lossLM2. We provide more examples of pseudo-paired data on various combinations of original and pair domainsinFig.3.





VTC-LFC: VisionTransformerCompressionwith Low-FrequencyComponents

Neural Information Processing Systems

However,thecompression only in the spatial domain suffers from a dramatic performance drop without finetuning and is not robust to noise, as the noise in the spatial domain can easily confuse the pruning criteria, leading to some parameters/channels being pruned incorrectly.


Bernoulli f n Z

Neural Information Processing Systems

Attime nodeof 2 have example, Wesimulate equally UASE, techniques omnib d =7 , while visualisation, above, 1. Cross-sectional: The 2. Longitudinal: The Inthissection stability described embedding P(1),. Independent UASE, on P tdt dT, but U thelinearvT, while d= ran P)isoftend.


Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection

arXiv.org Artificial Intelligence

Few-shot Video Object Detection (FSVOD) addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals, which are often computationally expensive and require task-specific training. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in a few-shot setting. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Our approach achieves performance gains, with AP improvements of 3.7% (FSVOD-500), 5.3% (FSYTV-40), 4.3% (VidOR), and 4.5 (VidVRD) in the 5-shot setting. Further results demonstrate improvements in 1-shot, 3-shot, and 10-shot configurations. We make the code public at: https://github.com/yogesh-iitj/fs-video-vit