Goto

Collaborating Authors

 Wang, Yanfeng


FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation

arXiv.org Artificial Intelligence

The statistical heterogeneity of the non-independent and identically distributed (non-IID) data in local clients significantly limits the performance of federated learning. Previous attempts like FedProx, SCAFFOLD, MOON, FedNova and FedDyn resort to an optimization perspective, which requires an auxiliary term or re-weights local updates to calibrate the learning bias or the objective inconsistency. However, in addition to previous explorations for improvement in federated averaging, our analysis shows that another critical bottleneck is the poorer optima of client models in more heterogeneous conditions. We thus introduce a data-driven approach called FedSkip to improve the client optima by periodically skipping federated averaging and scattering local models to the cross devices. We provide theoretical analysis of the possible benefit from FedSkip and conduct extensive experiments on a range of datasets to demonstrate that FedSkip achieves much higher accuracy, better aggregation efficiency and competing communication efficiency. Source code is available at: https://github.com/MediaBrain-SJTU/FedSkip.


Collaborative Uncertainty Benefits Multi-Agent Multi-Modal Trajectory Forecasting

arXiv.org Machine Learning

In multi-modal multi-agent trajectory forecasting, two major challenges have not been fully tackled: 1) how to measure the uncertainty brought by the interaction module that causes correlations among the predicted trajectories of multiple agents; 2) how to rank the multiple predictions and select the optimal predicted trajectory. In order to handle these challenges, this work first proposes a novel concept, collaborative uncertainty (CU), which models the uncertainty resulting from interaction modules. Then we build a general CU-aware regression framework with an original permutation-equivariant uncertainty estimator to do both tasks of regression and uncertainty estimation. Further, we apply the proposed framework to current SOTA multi-agent multi-modal forecasting systems as a plugin module, which enables the SOTA systems to 1) estimate the uncertainty in the multi-agent multi-modal trajectory forecasting task; 2) rank the multiple predictions and select the optimal one based on the estimated uncertainty. We conduct extensive experiments on a synthetic dataset and two public large-scale multi-agent trajectory forecasting benchmarks. Experiments show that: 1) on the synthetic dataset, the CU-aware regression framework allows the model to appropriately approximate the ground-truth Laplace distribution; 2) on the multi-agent trajectory forecasting benchmarks, the CU-aware regression framework steadily helps SOTA systems improve their performances. Specially, the proposed framework helps VectorNet improve by 262 cm regarding the Final Displacement Error of the chosen optimal prediction on the nuScenes dataset; 3) for multi-agent multi-modal trajectory forecasting systems, prediction uncertainty is positively correlated with future stochasticity; and 4) the estimated CU values are highly related to the interactive information among agents.


Handwritten Mathematical Expression Recognition via Attention Aggregation based Bi-directional Mutual Learning

arXiv.org Artificial Intelligence

Handwritten mathematical expression recognition aims to automatically generate LaTeX sequences from given images. Currently, attention-based encoder-decoder models are widely used in this task. They typically generate target sequences in a left-to-right (L2R) manner, leaving the right-to-left (R2L) contexts unexploited. In this paper, we propose an Attention aggregation based Bi-directional Mutual learning Network (ABM) which consists of one shared encoder and two parallel inverse decoders (L2R and R2L). The two decoders are enhanced via mutual distillation, which involves one-to-one knowledge transfer at each training step, making full use of the complementary information from two inverse directions. Moreover, in order to deal with mathematical symbols in diverse scales, an Attention Aggregation Module (AAM) is proposed to effectively integrate multi-scale coverage attentions. Notably, in the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference, keeping the original parameter size and inference speed. Extensive experiments demonstrate that our proposed approach achieves the recognition accuracy of 56.85 % on CROHME 2014, 52.92 % on CROHME 2016, and 53.96 % on CROHME 2019 without data augmentation and model ensembling, substantially outperforming the state-of-the-art methods. The source code is available in the supplementary materials.


Multiscale Spatio-Temporal Graph Neural Networks for 3D Skeleton-Based Motion Prediction

arXiv.org Artificial Intelligence

We propose a multiscale spatio-temporal graph neural network (MST-GNN) to predict the future 3D skeleton-based human poses in an action-category-agnostic manner. The core of MST-GNN is a multiscale spatio-temporal graph that explicitly models the relations in motions at various spatial and temporal scales. Different from many previous hierarchical structures, our multiscale spatio-temporal graph is built in a data-adaptive fashion, which captures nonphysical, yet motion-based relations. The key module of MST-GNN is a multiscale spatio-temporal graph computational unit (MST-GCU) based on the trainable graph structure. MST-GCU embeds underlying features at individual scales and then fuses features across scales to obtain a comprehensive representation. The overall architecture of MST-GNN follows an encoder-decoder framework, where the encoder consists of a sequence of MST-GCUs to learn the spatial and temporal features of motions, and the decoder uses a graph-based attention gate recurrent unit (GA-GRU) to generate future poses. Extensive experiments are conducted to show that the proposed MST-GNN outperforms state-of-the-art methods in both short and long-term motion prediction on the datasets of Human 3.6M, CMU Mocap and 3DPW, where MST-GNN outperforms previous works by 5.33% and 3.67% of mean angle errors in average for short-term and long-term prediction on Human 3.6M, and by 11.84% and 4.71% of mean angle errors for short-term and long-term prediction on CMU Mocap, and by 1.13% of mean angle errors on 3DPW in average, respectively. We further investigate the learned multiscale graphs for interpretability.


Decoupled Variational Embedding for Signed Directed Networks

arXiv.org Machine Learning

Node representation learning for signed directed networks has received considerable attention in many real-world applications such as link sign prediction, node classification and node recommendation. The challenge lies in how to adequately encode the complex topological information of the networks. Recent studies mainly focus on preserving the first-order network topology which indicates the closeness relationships of nodes. However, these methods generally fail to capture the high-order topology which indicates the local structures of nodes and serves as an essential characteristic of the network topology. In addition, for the first-order topology, the additional value of non-existent links is largely ignored. In this paper, we propose to learn more representative node embeddings by simultaneously capturing the first-order and high-order topology in signed directed networks. In particular, we reformulate the representation learning problem on signed directed networks from a variational auto-encoding perspective and further develop a decoupled variational embedding (DVE) method. DVE leverages a specially designed auto-encoder structure to capture both the first-order and high-order topology of signed directed networks, and thus learns more representative node embedding. Extensive experiments are conducted on three widely used real-world datasets. Comprehensive results on both link sign prediction and node recommendation task demonstrate the effectiveness of DVE. Qualitative results and analysis are also given to provide a better understanding of DVE.


Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data

arXiv.org Machine Learning

Data augmentation has been widely applied as an effective methodology to prevent over-fitting in particular when training very deep neural networks. The essential benefit comes from introducing additional priors in visual invariance, and thus generate images in different appearances but containing the same semantics. Recently, researchers proposed a few powerful data augmentation techniques which indeed improved accuracy, yet we notice that these methods have also caused a considerable gap between clean and augmented data. This paper revisits this problem from an analytical perspective, for which we estimate the upper-bound of testing loss using two terms, named empirical risk and generalization error, respectively. Data augmentation significantly reduces the generalization error, but meanwhile leads to a larger empirical risk, which can be alleviated by a simple algorithm, i.e., using less-augmented data to refine the model trained on fully-augmented data . We validate our approach on a few popular image classification datasets including CIFAR and ImageNet, and demonstrate consistent accuracy gain. We also conjecture that this simple strategy implies a generalized approach to circumvent local minima, which is of value to future research on model optimization.


Defending Adversarial Attacks by Correcting logits

arXiv.org Machine Learning

Generating and eliminating adversarial examples has been an intriguing topic in the field of deep learning. While previous research verified that adversarial attacks are often fragile and can be defended via image-level processing, it remains unclear how high-level features are perturbed by such attacks. We investigate this issue from a new perspective, which purely relies on logits, the class scores before softmax, to detect and defend adversarial attacks. Our defender is a two-layer network trained on a mixed set of clean and perturbed logits, with the goal being recovering the original prediction. Upon a wide range of adversarial attacks, our simple approach shows promising results with relatively high accuracy in defense, and the defender can transfer across attackers with similar properties. More importantly, our defender can work in the scenarios that image data are unavailable, and enjoys high interpretability especially at the semantic level.


Actional-Structural Graph Convolutional Networks for Skeleton-based Action Recognition

arXiv.org Artificial Intelligence

Action recognition with skeleton data has recently attracted much attention in computer vision. Previous studies are mostly based on fixed skeleton graphs, only capturing local physical dependencies among joints, which may miss implicit joint correlations. To capture richer dependencies, we introduce an encoder-decoder structure, called A-link inference module, to capture action-specific latent dependencies, i.e. actional links, directly from actions. We also extend the existing skeleton graphs to represent higher-order dependencies, i.e. structural links. Combing the two types of links into a generalized skeleton graph, we further propose the actional-structural graph convolution network (AS-GCN), which stacks actional-structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition. A future pose prediction head is added in parallel to the recognition head to help capture more detailed action patterns through self-supervision. We validate AS-GCN in action recognition using two skeleton data sets, NTU-RGB+D and Kinetics. The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods. As a side product, AS-GCN also shows promising results for future pose prediction.


Network Compression via Recursive Bayesian Pruning

arXiv.org Machine Learning

Recently, compression and acceleration of deep neural networks are in critic need. Bayesian generalization of structured pruning represents an important research direction to solve the above problem. However, the existing Bayesian methods ignore the dependency among neurons and filters for computational simplicity. In this study, we explore, under Bayesian framework, a structured pruning method with layer-wise sequential dependency assumed, a more general learning setting. Based on the property of Dirac distribution, we further derive a new dropout noise, which makes it possible to approximate the posterior of dropout noise knowing that of the previous layer. With the Dirac-like dropout noise, we further propose a recursive strategy, named \emph{Recursive Bayesian Pruning} (RBP), to train and prune networks in a layer-by-layer fashion. The unimportant neurons and filters are directly targeted and removed, taking the influence from the previous layer. Experiments on typical neural networks LeNet-300-100, LeNet-5 and VGG-16 have demonstrated the proposed method are competitive with or even outperform the state-of-the-art methods in several compression and acceleration metrics.


Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems

AAAI Conferences

Speech interaction systems have been gaining popularity in recent years. The main purpose of these systems is to generate more satisfactory responses according to users' speech utterances, in which the most critical problem is to analyze user intention. Researches show that user intention conveyed through speech is not only expressed by content, but also closely related with users' speaking manners (e.g. with or without acoustic emphasis). How to incorporate these heterogeneous attributes to infer user intention remains an open problem. In this paper, we define Intention Prominence (IP) as the semantic combination of focus by text and emphasis by speech, and propose a multi-task deep learning framework to predict IP. Specifically, we first use long short-term memory (LSTM) which is capable of modeling long short-term contextual dependencies to detect focus and emphasis, and incorporate the tasks for focus and emphasis detection with multi-task learning (MTL) to reinforce the performance of each other. We then employ Bayesian network (BN) to incorporate multimodal features (focus, emphasis, and location reflecting users' dialect conventions) to predict IP based on feature correlations. Experiments on a data set of 135,566 utterances collected from real-world Sogou Voice Assistant illustrate that our method can outperform the comparison methods over 6.9-24.5% in terms of F1-measure. Moreover, a real practice in the Sogou Voice Assistant indicates that our method can improve the performance on user intention understanding by 7%.