Gong, Minglun
Systematic Literature Review of Vision-Based Approaches to Outdoor Livestock Monitoring with Lessons from Wildlife Studies
Scott, Stacey D., Abbas, Zayn J., Ellid, Feerass, Dykhne, Eli-Henry, Islam, Muhammad Muhaiminul, Ayad, Weam, Kacmorova, Kristina, Tulpan, Dan, Gong, Minglun
Precision livestock farming (PLF) aims to improve the health and welfare of livestock animals and farming outcomes through the use of advanced technologies. Computer vision, combined with recent advances in machine learning and deep learning artificial intelligence approaches, offers a possible solution to the PLF ideal of 24/7 livestock monitoring that helps facilitate early detection of animal health and welfare issues. However, a significant number of livestock species are raised in large outdoor habitats that pose technological challenges for computer vision approaches. This review provides a comprehensive overview of computer vision methods and open challenges in outdoor animal monitoring. We include research from both the livestock and wildlife fields in the review because of the similarities in appearance, behaviour, and habitat for many livestock and wildlife. We focus on large terrestrial mammals, such as cattle, horses, deer, goats, sheep, koalas, giraffes, and elephants. We use an image processing pipeline to frame our discussion and highlight the current capabilities and open technical challenges at each stage of the pipeline. The review found a clear trend towards the use of deep learning approaches for animal detection, counting, and multi-species classification. We discuss in detail the applicability of current vision-based methods to PLF contexts and promising directions for future research.
Neural Packing: from Visual Sensing to Reinforcement Learning
Xu, Juzhan, Gong, Minglun, Zhang, Hao, Huang, Hui, Hu, Ruizhen
We present a novel learning framework to solve the transport-and-packing (TAP) problem in 3D. It constitutes a full solution pipeline from partial observations of input objects via RGBD sensing and recognition to final box placement, via robotic motion planning, to arrive at a compact packing in a target container. The technical core of our method is a neural network for TAP, trained via reinforcement learning (RL), to solve the NP-hard combinatorial optimization problem. Our network simultaneously selects an object to pack and determines the final packing location, based on a judicious encoding of the continuously evolving states of partially observed source objects and available spaces in the target container, using separate encoders both enabled with attention mechanisms. The encoded feature vectors are employed to compute the matching scores and feasibility masks of different pairings of box selection and available space configuration for packing strategy optimization. Extensive experiments, including ablation studies and physical packing execution by a real robot (Universal Robot UR5e), are conducted to evaluate our method in terms of its design choices, scalability, generalizability, and comparisons to baselines, including the most recent RL-based TAP solution. We also contribute the first benchmark for TAP which covers a variety of input settings and difficulty levels.
3D Pose Estimation and Future Motion Prediction from 2D Images
Yang, Ji, Ma, Youdong, Zuo, Xinxin, Wang, Sen, Gong, Minglun, Cheng, Li
In many recent efforts [1, 2, 3, 4], 3D human pose estimation has been decomposed into a two-stage process: first, the 2D keypoints that correspond to the body joints are detected from the 2D image, after which the detected joints are lifted to obtain 3D pose. This type of solution is elegant in terms of the simplicity of problem formulation, unfortunately it suffers from inherent ambiguities caused by projection: different 3D poses can share the same 2D pose projection given a specific viewpoint; that is, the mapping between the 2D joints detection and 3D pose is not bijective. To resolve this ambiguity of 3D pose estimation from a monocular image, video-based pose estimation is also investigated in the literature [5, 6]. Existing video-based pose estimation methods, however, either need to observe a relatively long history (243 frames [5] or can only handle a short video sequence (4-6 frames [6]) to achieve their best results.
STNet: Scale Tree Network with Multi-level Auxiliator for Crowd Counting
Wang, Mingjie, Cai, Hao, Han, Xianfeng, Zhou, Jun, Gong, Minglun
Crowd counting remains a challenging task because the presence of drastic scale variation, density inconsistency, and complex background can seriously degrade the counting accuracy. To battle the ingrained issue of accuracy degradation, we propose a novel and powerful network called Scale Tree Network (STNet) for accurate crowd counting. STNet consists of two key components: a Scale-Tree Diversity Enhancer and a Semi-supervised Multi-level Auxiliator. Specifically, the Diversity Enhancer is designed to enrich scale diversity, which alleviates limitations of existing methods caused by insufficient level of scales. A novel tree structure is adopted to hierarchically parse coarse-to-fine crowd regions. Furthermore, a simple yet effective Multi-level Auxiliator is presented to aid in exploiting generalisable shared characteristics at multiple levels, allowing more accurate pixel-wise background cognition. The overall STNet is trained in an end-to-end manner, without the needs for manually tuning loss weights between the main and the auxiliary tasks. Extensive experiments on four challenging crowd datasets demonstrate the superiority of the proposed method.
Multi-scale Convolution Aggregation and Stochastic Feature Reuse for DenseNets
Wang, Mingjie, Zhou, Jun, Mao, Wendong, Gong, Minglun
Recently, Convolution Neural Networks (CNNs) obtained huge success in numerous vision tasks. In particular, DenseNets have demonstrated that feature reuse via dense skip connections can effectively alleviate the difficulty of training very deep networks and that reusing features generated by the initial layers in all subsequent layers has strong impact on performance. To feed even richer information into the network, a novel adaptive Multi-scale Convolution Aggregation module is presented in this paper. Composed of layers for multi-scale convolutions, trainable cross-scale aggregation, maxout, and concatenation, this module is highly non-linear and can boost the accuracy of DenseNet while using much fewer parameters. In addition, due to high model complexity, the network with extremely dense feature reuse is prone to overfitting. To address this problem, a regularization method named Stochastic Feature Reuse is also presented. Through randomly dropping a set of feature maps to be reused for each mini-batch during the training phase, this regularization method reduces training costs and prevents co-adaptation. Experimental results on CIFAR-10, CIFAR-100 and SVHN benchmarks demonstrated the effectiveness of the proposed methods.