Goto

Collaborating Authors

 Chen, Long


Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

arXiv.org Artificial Intelligence

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref-NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS.


Defining Digital Quadruplets in the Cyber-Physical-Social Space for Parallel Driving

arXiv.org Artificial Intelligence

Parallel driving is a novel framework to synthesize vehicle intelligence and transport automation. This article aims to define digital quadruplets in parallel driving. In the cyber-physical-social systems (CPSS), based on the ACP method, the names of the digital quadruplets are first given, which are descriptive, predictive, prescriptive and real vehicles. The objectives of the three virtual digital vehicles are interacting, guiding, simulating and improving with the real vehicles. Then, the three virtual components of the digital quadruplets are introduced in detail and their applications are also illustrated. Finally, the real vehicles in the parallel driving system and the research process of the digital quadruplets are depicted. The presented digital quadruplets in parallel driving are expected to make the future connected automated driving safety, efficiently and synergistically.


Digital Quadruplets for Cyber-Physical-Social Systems based Parallel Driving: From Concept to Applications

arXiv.org Artificial Intelligence

Digital quadruplets aiming to improve road safety, traffic efficiency, and driving cooperation for future connected automated vehicles are proposed with the enlightenment of ACP based parallel driving. The ACP method denotes Artificial societies, Computational experiments, and Parallel execution modules for cyber-physical-social systems. Four agents are designed in the framework of digital quadruplets: descriptive vehicles, predictive vehicles, prescriptive vehicles, and real vehicles. The three virtual vehicles (descriptive, predictive, and prescriptive) dynamically interact with the real one in order to enhance the safety and performance of the real vehicle. The details of the three virtual vehicles in the digital quadruplets are described. Then, the interactions between the virtual and real vehicles are presented. The experimental results of the digital quadruplets demonstrate the effectiveness of the proposed framework.


Comparison of Different Methods for Time Sequence Prediction in Autonomous Vehicles

arXiv.org Artificial Intelligence

As a combination of various kinds of technologies, autonomous vehicles could complete a series of driving tasks by itself, such as perception, decision-making, planning, and control. Since there is no human driver to handle the emergency situation, future transportation information is significant for automated vehicles. This paper proposes different methods to forecast the time series for autonomous vehicles, which are the nearest neighborhood (NN), fuzzy coding (FC), and long short term memory (LSTM). First, the formulation and operational process for these three approaches are introduced. Then, the vehicle velocity is regarded as a case study and the real-world dataset is utilized to predict future information via these techniques. Finally, the performance, merits, and drawbacks of the presented methods are analyzed and discussed.


MR-GNN: Multi-Resolution and Dual Graph Neural Network for Predicting Structured Entity Interactions

arXiv.org Machine Learning

Predicting interactions between structured entities lies at the core of numerous tasks such as drug regimen and new material design. In recent years, graph neural networks have become attractive. They represent structured entities as graphs and then extract features from each individual graph using graph convolution operations. However, these methods have some limitations: i) their networks only extract features from a fix-sized subgraph structure (i.e., a fix-sized receptive field) of each node, and ignore features in substructures of different sizes, and ii) features are extracted by considering each entity independently, which may not effectively reflect the interaction between two entities. To resolve these problems, we present MR-GNN, an end-to-end graph neural network with the following features: i) it uses a multi-resolution based architecture to extract node features from different neighborhoods of each node, and, ii) it uses dual graph-state long short-term memory networks (L-STMs) to summarize local features of each graph and extracts the interaction features between pairwise graphs. Experiments conducted on real-world datasets show that MR-GNN improves the prediction of state-of-the-art methods.


Maximum Principle Based Algorithms for Deep Learning

arXiv.org Machine Learning

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.


Improving Negative Sampling for Word Representation using Self-embedded Features

arXiv.org Machine Learning

Although the word-popularity based negative sampler has shown superb performance in the skip-gram model, the theoretical motivation behind oversampling popular (non-observed) words as negative samples is still not well understood. In this paper, we start from an investigation of the gradient vanishing issue in the skip-gram model without a proper negative sampler. By performing an insightful analysis from the stochastic gradient descent (SGD) learning perspective, we demonstrate that, both theoretically and intuitively, negative samples with larger inner product scores are more informative than those with lower scores for the SGD learner in terms of both convergence rate and accuracy. Understanding this, we propose an alternative sampling algorithm that dynamically selects informative negative samples during each SGD update. More importantly, the proposed sampler accounts for multi-dimensional self-embedded features during the sampling process, which essentially makes it more effective than the original popularity-based (one-dimensional) sampler. Empirical experiments further verify our observations, and show that our fine-grained samplers gain significant improvement over the existing ones without increasing computational complexity.


Real-Time Fashion-Guided Clothing Semantic Parsing: A Lightweight Multi-Scale Inception Neural Network and Benchmark

AAAI Conferences

Currently two barriers exist that sabotage clothing semantic parsing research: existing methods are time-consuming and the lack of large publicly available dataset that enables parsing at multiple scales. To mitigate these two dilemmas, we hereby embrace deep learning method and design a lightweight multi-scale inception neural network which is at both inside and outside multi-scale inception during training. Moreover, atrous convolution block is involved to enlarge the field of view while bringing neither extra computation cost nor parameters. Then the pre-trained model is further pruned and compressed by fine-tuning on a lightweight version of the same network used earlier, in which the inactive feature response and connections below a pre-defined threshold are directly removed. Besides, we construct so far the largest fashion guided clothing semantic parsing dataset (FCP) which contains a total of 5,000 clothing images and each image associates with both pixel-level, object-level and image-level annotations. All clothing in the dataset are recommended by fashion experts or trendsetters and contains as many as 65 common clothing items, accessories. We organize the dataset as Wordnet tree structure so that it enables fashionably parsing hierarchically. Finally, we conduct extensive experiments on three currently available datasets. Both quantitative and qualitative results demonstrate the priority and feasibility of our method, comparing with several other deep learning based methods. Our method achieves 35 FPS in a single Nvidia Titian X GPU with only minimal accuracy loss.