Not enough data to create a plot.
Try a different view from the menu above.
Chen, Xing
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Huang, Haoyang, Ma, Guoqing, Duan, Nan, Chen, Xing, Wan, Changyi, Ming, Ranchen, Wang, Tianyu, Wang, Bo, Lu, Zhiying, Li, Aojie, Zeng, Xianfang, Zhang, Xinhao, Yu, Gang, Yin, Yuhe, Wu, Qiling, Sun, Wen, An, Kang, Han, Xin, Sun, Deshan, Ji, Wei, Huang, Bizhu, Li, Brian, Wu, Chenfei, Huang, Guanzhe, Xiong, Huixin, He, Jiaxin, Wu, Jianchang, Yuan, Jianlong, Wu, Jie, Liu, Jiashuai, Guo, Junjing, Tan, Kaijun, Chen, Liangyu, Chen, Qiaohui, Sun, Ran, Yuan, Shanshan, Yin, Shengming, Liu, Sitong, Chen, Wei, Dai, Yaqi, Luo, Yuchu, Ge, Zheng, Guan, Zhisheng, Song, Xiaoniu, Zhou, Yu, Jiao, Binxing, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Xiu, Yi, Zhu, Yibo, Shum, Heung-Yeung, Jiang, Daxin
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Ma, Guoqing, Huang, Haoyang, Yan, Kun, Chen, Liangyu, Duan, Nan, Yin, Shengming, Wan, Changyi, Ming, Ranchen, Song, Xiaoniu, Chen, Xing, Zhou, Yu, Sun, Deshan, Zhou, Deyu, Zhou, Jian, Tan, Kaijun, An, Kang, Chen, Mei, Ji, Wei, Wu, Qiling, Sun, Wen, Han, Xin, Wei, Yanan, Ge, Zheng, Li, Aojie, Wang, Bin, Huang, Bizhu, Wang, Bo, Li, Brian, Miao, Changxing, Xu, Chen, Wu, Chenfei, Yu, Chenguang, Shi, Dapeng, Hu, Dingyuan, Liu, Enle, Yu, Gang, Yang, Ge, Huang, Guanzhe, Yan, Gulin, Feng, Haiyang, Nie, Hao, Jia, Haonan, Hu, Hanpeng, Chen, Hanqi, Yan, Haolong, Wang, Heng, Guo, Hongcheng, Xiong, Huilin, Xiong, Huixin, Gong, Jiahao, Wu, Jianchang, Wu, Jiaoren, Wu, Jie, Yang, Jie, Liu, Jiashuai, Li, Jiashuo, Zhang, Jingyang, Guo, Junjing, Lin, Junzhe, Li, Kaixiang, Liu, Lei, Xia, Lei, Zhao, Liang, Tan, Liguo, Huang, Liwen, Shi, Liying, Li, Ming, Li, Mingliang, Cheng, Muhua, Wang, Na, Chen, Qiaohui, He, Qinglin, Liang, Qiuyan, Sun, Quan, Sun, Ran, Wang, Rui, Pang, Shaoliang, Yang, Shiliang, Liu, Sitong, Liu, Siqi, Gao, Shuli, Cao, Tiancheng, Wang, Tianyu, Ming, Weipeng, He, Wenqing, Zhao, Xu, Zhang, Xuelin, Zeng, Xianfang, Liu, Xiaojia, Yang, Xuan, Dai, Yaqi, Yu, Yanbo, Li, Yang, Deng, Yineng, Wang, Yingming, Wang, Yilei, Lu, Yuanwei, Chen, Yu, Luo, Yu, Luo, Yuchu, Yin, Yuhe, Feng, Yuheng, Yang, Yuxiang, Tang, Zecheng, Zhang, Zekai, Yang, Zidong, Jiao, Binxing, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Zhang, Xinhao, Zhu, Yibo, Shum, Heung-Yeung, Jiang, Daxin
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
SpreadFGL: Edge-Client Collaborative Federated Graph Learning with Adaptive Neighbor Generation
Zhong, Luying, Pi, Yueyang, Chen, Zheyi, Yu, Zhengxin, Miao, Wang, Chen, Xing, Min, Geyong
Federated Graph Learning (FGL) has garnered widespread attention by enabling collaborative training on multiple clients for semi-supervised classification tasks. However, most existing FGL studies do not well consider the missing inter-client topology information in real-world scenarios, causing insufficient feature aggregation of multi-hop neighbor clients during model training. Moreover, the classic FGL commonly adopts the FedAvg but neglects the high training costs when the number of clients expands, resulting in the overload of a single edge server. To address these important challenges, we propose a novel FGL framework, named SpreadFGL, to promote the information flow in edge-client collaboration and extract more generalized potential relationships between clients. In SpreadFGL, an adaptive graph imputation generator incorporated with a versatile assessor is first designed to exploit the potential links between subgraphs, without sharing raw data. Next, a new negative sampling mechanism is developed to make SpreadFGL concentrate on more refined information in downstream tasks. To facilitate load balancing at the edge layer, SpreadFGL follows a distributed training manner that enables fast model convergence. Using real-world testbed and benchmark graph datasets, extensive experiments demonstrate the effectiveness of the proposed SpreadFGL. The results show that SpreadFGL achieves higher accuracy and faster convergence against state-of-the-art algorithms.
Careful at Estimation and Bold at Exploration
Chen, Xing, Liu, Yijun, Liu, Zhaogeng, Chen, Hechang, Yao, Hengshuai, Chang, Yi
Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two different Q functions. For practicality, this theoretical basis is then extended to allow us to combine action exploration with the Q value update, except for the premise that we have a surrogate policy that behaves like this exploration policy. In practice, we construct such an exploration policy with a few sampled actions, and to meet the premise, we learn such a surrogate policy by minimizing the KL divergence between the target policy and the exploration policy constructed by the conservative Q. We evaluate our method on the Mujoco benchmark and demonstrate superior performance compared to previous state-of-the-art methods across various environments, particularly in the most complex Humanoid environment.
LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification
Chen, Xing, Wang, Jie, Zhang, Xiao-Lei, Zhang, Wei-Qiang, Yang, Kunde
Although the security of automatic speaker verification (ASV) is seriously threatened by recently emerged adversarial attacks, there have been some countermeasures to alleviate the threat. However, many defense approaches not only require the prior knowledge of the attackers but also possess weak interpretability. To address this issue, in this paper, we propose an attacker-independent and interpretable method, named learnable mask detector (LMD), to separate adversarial examples from the genuine ones. It utilizes score variation as an indicator to detect adversarial examples, where the score variation is the absolute discrepancy between the ASV scores of an original audio recording and its transformed audio synthesized from its masked complex spectrogram. A core component of the score variation detector is to generate the masked spectrogram by a neural network. The neural network needs only genuine examples for training, which makes it an attacker-independent approach. Its interpretability lies that the neural network is trained to minimize the score variation of the targeted ASV, and maximize the number of the masked spectrogram bins of the genuine training examples. Its foundation is based on the observation that, masking out the vast majority of the spectrogram bins with little speaker information will inevitably introduce a large score variation to the adversarial example, and a small score variation to the genuine example. Experimental results with 12 attackers and two representative ASV systems show that our proposed method outperforms five state-of-the-art baselines. The extensive experimental results can also be a benchmark for the detection-based ASV defenses.
Model Extraction Attacks on Split Federated Learning
Li, Jingtao, Rakin, Adnan Siraj, Chen, Xing, Yang, Li, He, Zhezhi, Fan, Deliang, Chakrabarti, Chaitali
Federated Learning (FL) is a popular collaborative learning scheme involving multiple clients and a server. FL focuses on protecting clients' data but turns out to be highly vulnerable to Intellectual Property (IP) threats. Since FL periodically collects and distributes the model parameters, a free-rider can download the latest model and thus steal model IP. Split Federated Learning (SFL), a recent variant of FL that supports training with resource-constrained clients, splits the model into two, giving one part of the model to clients (client-side model), and the remaining part to the server (server-side model). Thus SFL prevents model leakage by design. Moreover, by blocking prediction queries, it can be made resistant to advanced IP threats such as traditional Model Extraction (ME) attacks. While SFL is better than FL in terms of providing IP protection, it is still vulnerable. In this paper, we expose the vulnerability of SFL and show how malicious clients can launch ME attacks by querying the gradient information from the server side. We propose five variants of ME attack which differs in the gradient usage as well as in the data assumptions. We show that under practical cases, the proposed ME attacks work exceptionally well for SFL. For instance, when the server-side model has five layers, our proposed ME attack can achieve over 90% accuracy with less than 2% accuracy degradation with VGG-11 on CIFAR-10.
The Sufficiency of Off-Policyness and Soft Clipping: PPO is still Insufficient according to an Off-Policy Measure
Chen, Xing, Diao, Dongcui, Chen, Hechang, Yao, Hengshuai, Piao, Haiyin, Sun, Zhixiao, Yang, Zhiwei, Goebel, Randy, Jiang, Bei, Chang, Yi
The popular Proximal Policy Optimization (PPO) algorithm approximates the solution in a clipped policy space. Does there exist better policies outside of this space? By using a novel surrogate objective that employs the sigmoid function (which provides an interesting way of exploration), we found that the answer is ``YES'', and the better policies are in fact located very far from the clipped space. We show that PPO is insufficient in ``off-policyness'', according to an off-policy metric called DEON. Our algorithm explores in a much larger policy space than PPO, and it maximizes the Conservative Policy Iteration (CPI) objective better than PPO during training. To the best of our knowledge, all current PPO methods have the clipping operation and optimize in the clipped policy space. Our method is the first of this kind, which advances the understanding of CPI optimization and policy gradient methods. Code is available at https://github.com/raincchio/P3O.
Symmetric Saliency-based Adversarial Attack To Speaker Identification
Yao, Jiadi, Chen, Xing, Zhang, Xiao-Lei, Zhang, Wei-Qiang, Yang, Kunde
Adversarial attack approaches to speaker identification either need high computational cost or are not very effective, to our knowledge. To address this issue, in this paper, we propose a novel generation-network-based approach, called symmetric saliency-based encoder-decoder (SSED), to generate adversarial voice examples to speaker identification. It contains two novel components. First, it uses a novel saliency map decoder to learn the importance of speech samples to the decision of a targeted speaker identification system, so as to make the attacker focus on generating artificial noise to the important samples. It also proposes an angular loss function to push the speaker embedding far away from the source speaker. Our experimental results demonstrate that the proposed SSED yields the state-of-the-art performance, i.e. over 97% targeted attack success rate and a signal-to-noise level of over 39 dB on both the open-set and close-set speaker identification tasks, with a low computational cost.
Revenue and Energy Efficiency-Driven Delay Constrained Computing Task Offloading and Resource Allocation in a Vehicular Edge Computing Network: A Deep Reinforcement Learning Approach
Huang, Xinyu, He, Lijun, Chen, Xing, Wang, Liejun, Li, Fan
For in-vehicle application,task type and vehicle state information, i.e., vehicle speed, bear a significant impact on the task delay requirement. However, the joint impact of task type and vehicle speed on the task delay constraint has not been studied, and this lack of study may cause a mismatch between the requirement of the task delay and allocated computation and wireless resources. In this paper, we propose a joint task type and vehicle speed-aware task offloading and resource allocation strategy to decrease the vehicl's energy cost for executing tasks and increase the revenue of the vehicle for processing tasks within the delay constraint. First, we establish the joint task type and vehicle speed-aware delay constraint model. Then, the delay, energy cost and revenue for task execution in the vehicular edge computing (VEC) server, local terminal and terminals of other vehicles are calculated. Based on the energy cost and revenue from task execution,the utility function of the vehicle is acquired. Next, we formulate a joint optimization of task offloading and resource allocation to maximize the utility level of the vehicles subject to the constraints of task delay, computation resources and wireless resources. To obtain a near-optimal solution of the formulated problem, a joint offloading and resource allocation based on the multi-agent deep deterministic policy gradient (JORA-MADDPG) algorithm is proposed to maximize the utility level of vehicles. Simulation results show that our algorithm can achieve superior performance in task completion delay, vehicles' energy cost and processing revenue.
Recommending Related Microblogs: A Comparison Between Topic and WordNet based Approaches
Chen, Xing (Wuhan University of Technology) | Li, Lin (Wuhan University of Technology) | Xu, Guandong (Victoria University) | Yang, Zhenglu (The University of Tokyo) | Kitsuregawa, Masaru (The University of Tokyo)
Computing similarity between short microblogs is an important step in microblog recommendation. In this paper, we investigate a topic based approach and a WordNet based approach to estimate similarity scores between microblogs and recommend top related ones to users. Empirical study is conducted to compare their recommendation effectiveness using two evaluation measures. The results show that the WordNet based approach has relatively higher precision than that of the topic based approach using 548 tweets as dataset. In addition, the Kendall tau distance between two lists recommended by WordNet and topic approaches is calculated. Its average of all the 548 pair lists tells us the two approaches have the relative high disaccord in the ranking of related tweets.