Goto

Collaborating Authors

 synchronization


SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Wang, Kaidi, He, Yi, Guan, Wenhao, Wu, Weijie, Ding, Hongwu, Zhang, Xiong, Wu, Di, Meng, Meng, Luan, Jian, Li, Lin, Hong, Qingyang

arXiv.org Artificial Intelligence

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.


FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Nrusimha, Aniruddha, Brandon, William, Mishra, Mayank, Shen, Yikang, Panda, Rameswar, Ragan-Kelley, Jonathan, Kim, Yoon

arXiv.org Artificial Intelligence

The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.


Robust and Modular Multi-Limb Synchronization in Motion Stack for Space Robots with Trajectory Clamping via Hypersphere

Neppel, Elian, Mishra, Ashutosh, Karimov, Shamistan, Uno, Kentaro, Santra, Shreya, Yoshida, Kazuya

arXiv.org Artificial Intelligence

Modular robotics holds immense potential for space exploration, where reliability, repairability, and reusability are critical for cost-effective missions. Coordination between heterogeneous units is paramount for precision tasks -- whether in manipulation, legged locomotion, or multi-robot interaction. Such modular systems introduce challenges far exceeding those in monolithic robot architectures. This study presents a robust method for synchronizing the trajectories of multiple heterogeneous actuators, adapting dynamically to system variations with minimal system knowledge. This design makes it inherently robot-agnostic, thus highly suited for modularity. To ensure smooth trajectory adherence, the multidimensional state is constrained within a hypersphere representing the allowable deviation. The distance metric can be adapted hence, depending on the task and system under control, deformation of the constraint region is possible. This approach is compatible with a wide range of robotic platforms and serves as a core interface for Motion-Stack, our new open-source universal framework for limb coordination (available at https://github.com/2lian/Motion-Stack ). The method is validated by synchronizing the end-effectors of six highly heterogeneous robotic limbs, evaluating both trajectory adherence and recovery from significant external disturbances.


Over-the-Air Federated Learning: Rethinking Edge AI Through Signal Processing

Azimi-Abarghouyi, Seyed Mohammad, Fischione, Carlo, Huang, Kaibin

arXiv.org Artificial Intelligence

Over-the-Air Federated Learning (AirFL) is an emerging paradigm that tightly integrates wireless signal processing and distributed machine learning to enable scalable AI at the network edge. By leveraging the superposition property of wireless signals, AirFL performs communication and model aggregation of the learning process simultaneously, significantly reducing latency, bandwidth, and energy consumption. This article offers a tutorial treatment of AirFL, presenting a novel classification into three design approaches: CSIT -aware, blind, and weighted AirFL. We provide a comprehensive guide to theoretical foundations, performance analysis, complexity considerations, practical limitations, and prospective research directions.


In-Context Sync-LoRA for Portrait Video Editing

Polaczek, Sagi, Patashnik, Or, Mahdavi-Amiri, Ali, Cohen-Or, Daniel

arXiv.org Artificial Intelligence

Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. W e present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. T o enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.


Tensor-Parallelism with Partially Synchronized Activations

Lamprecht, Itay, Karnieli, Asaf, Hanani, Yair, Giladi, Niv, Soudry, Daniel

arXiv.org Artificial Intelligence

Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained without fully synchronizing activations, reducing bandwidth demands. We name this "Communication-Aware Architecture for Tensor-parallelism" (CAAT-Net). We train a 7B parameter CAAT-Net model and show that tensor-parallel communication can be reduced by up to 50% with no significant drop in pretraining accuracy across nearly all evaluated benchmarks. We also experiment with smaller 130M and 1.1B models to show the robustness and scalability of our method. We find that, in some scenarios, validation loss can even improve when reducing communication. Finally, we demonstrate how CAAT-Net accelerates both training and inference workloads across various settings and model sizes.


Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion

Liu, Shaowei, Yao, David Yifan, Gupta, Saurabh, Wang, Shenlong

arXiv.org Artificial Intelligence

Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.


Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Koh, Jungyeon, Yang, Hyun Jong

arXiv.org Artificial Intelligence

The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.


IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

Chen, Bo, Liu, Tao, Chen, Qi, Chen, Xie, Zheng, Zilong

arXiv.org Artificial Intelligence

Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.


Motion-to-Motion Latency Measurement Framework for Connected and Autonomous Vehicle Teleoperation

Provost, François, Hawlader, Faisal, Testouri, Mehdi, Frank, Raphaël

arXiv.org Artificial Intelligence

Latency is a key performance factor for the teleoperation of Connected and Autonomous Vehicles (CAVs). It affects how quickly an operator can perceive changes in the driving environment and apply corrective actions. Most existing work focuses on Glass-to-Glass (G2G) latency, which captures delays only in the video pipeline. However, there is no standard method for measuring Motion-to-Motion (M2M) latency, defined as the delay between the physical steering movement of the remote operator and the corresponding steering motion in the vehicle. This paper presents an M2M latency measurement framework that uses Hall-effect sensors and two synchronized Raspberry Pi~5 devices. The system records interrupt-based timestamps on both sides to estimate M2M latency, independently of the underlying teleoperation architecture. Precision tests show an accuracy of 10--15~ms, while field results indicate that actuator delays dominate M2M latency, with median values above 750~ms.