"What exactly is computer vision then? Computer vision is a research field working to equip computers with the ability to process and understand visual data, as sighted humans can. Human brains process the gigabytes of data passing through our eyes every second and translate that data into sight - that is, into discrete objects and entities we can recognise or understand. Similarly, computer vision aims to give computers the ability to understand what they are seeing, and act intelligently on that knowledge."
– Computer vision: Cheat Sheet. ZDNet.com (December 6, 2011), by Natasha Lomas.
As we make tremendous advances in machine learning and artificial intelligence technosciences, there is a renewed understanding in the AI community that we must ensure that humans being are at the center of our deliberations so that we don't end in technology-induced dystopias. As strongly argued by Green in his book Smart Enough City, the incorporation of technology in city environs does not automatically translate into prosperity, wellbeing, urban livability, or social justice. There is a great need to deliberate on the future of the cities worth living and designing. There are philosophical and ethical questions involved along with various challenges that relate to the security, safety, and interpretability of AI algorithms that will form the technological bedrock of future cities. Several research institutes on human centered AI have been established at top international universities. Globally there are calls for technology to be made more humane and human-compatible. For example, Stuart Russell has a book called Human Compatible AI. The Center for Humane Technology advocates for regulators and technology companies to avoid business models and product features that contribute to social problems such as extremism, polarization, misinformation, and Internet addiction. In this paper, we analyze and explore key challenges including security, robustness, interpretability, and ethical challenges to a successful deployment of AI or ML in human-centric applications, with a particular emphasis on the convergence of these challenges. We provide a detailed review of existing literature on these key challenges and analyze how one of these challenges may lead to others or help in solving other challenges. The paper also advises on the current limitations, pitfalls, and future directions of research in these domains, and how it can fill the current gaps and lead to better solutions.
Reliably quantifying the confidence of deep neural classifiers is a challenging yet fundamental requirement for deploying such models in safety-critical applications. In this paper, we introduce a novel target criterion for model confidence, namely the true class probability (TCP). We show that TCP offers better properties for confidence estimation than standard maximum class probability (MCP). Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model, introducing a specific learning scheme adapted to this context. We evaluate our approach on the task of failure prediction and of self-training with pseudo-labels for domain adaptation, which both necessitate effective confidence estimates. Extensive experiments are conducted for validating the relevance of the proposed approach in each task. We study various network architectures and experiment with small and large datasets for image classification and semantic segmentation. In every tested benchmark, our approach outperforms strong baselines.
Image captioning is a challenging computer vision task, which aims to generate a natural language description of an image. Most recent researches follow the encoder-decoder framework which depends heavily on the previous generated words for the current prediction. Such methods can not effectively take advantage of the future predicted information to learn complete semantics. In this paper, we propose Context-Aware Auxiliary Guidance (CAAG) mechanism that can guide the captioning model to perceive global contexts. Upon the captioning model, CAAG performs semantic attention that selectively concentrates on useful information of the global predictions to reproduce the current generation. To validate the adaptability of the method, we apply CAAG to three popular captioners and our proposal achieves competitive performance on the challenging Microsoft COCO image captioning benchmark, e.g. 132.2 CIDEr-D score on Karpathy split and 130.7 CIDEr-D (c40) score on official online evaluation server.
In this era of active development of autonomous vehicles, it becomes crucial to provide driving systems with the capacity to explain their decisions. In this work, we focus on generating high-level driving explanations as the vehicle drives. We present BEEF, for BEhavior Explanation with Fusion, a deep architecture which explains the behavior of a trajectory prediction model. Supervised by annotations of human driving decisions justifications, BEEF learns to fuse features from multiple levels. Leveraging recent advances in the multi-modal fusion literature, BEEF is carefully designed to model the correlations between high-level decisions features and mid-level perceptual features. The flexibility and efficiency of our approach are validated with extensive experiments on the HDD and BDD-X datasets.
Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at https://github.com/fregu856/ebms_3dod.
This paper proposes a multi-sensor based approach to detect, track, and localize a quadcopter unmanned aerial vehicle (UAV). Specifically, a pipeline is developed to process monocular RGB and thermal video (captured from a fixed platform) to detect and track the UAV in our FoV. Subsequently, a 2D planar lidar is used to allow conversion of pixel data to actual distance measurements, and thereby enable localization of the UAV in global coordinates. The monocular data is processed through a deep learning-based object detection method that computes an initial bounding box for the UAV. The thermal data is processed through a thresholding and Kalman filter approach to detect and track the bounding box. Training and testing data are prepared by combining a set of original experiments conducted in a motion capture environment and publicly available UAV image data. The new pipeline compares favorably to existing methods and demonstrates promising tracking and localization capacity of sample experiments.
In the last years, AI safety gained international recognition in the light of heterogeneous safety-critical and ethical issues that risk overshadowing the broad beneficial impacts of AI. In this context, the implementation of AI observatory endeavors represents one key research direction. This paper motivates the need for an inherently transdisciplinary AI observatory approach integrating diverse retrospective and counterfactual views. We delineate aims and limitations while providing hands-on-advice utilizing concrete practical examples. Distinguishing between unintentionally and intentionally triggered AI risks with diverse socio-psycho-technological impacts, we exemplify a retrospective descriptive analysis followed by a retrospective counterfactual risk analysis. Building on these AI observatory tools, we present near-term transdisciplinary guidelines for AI safety. As further contribution, we discuss differentiated and tailored long-term directions through the lens of two disparate modern AI safety paradigms. For simplicity, we refer to these two different paradigms with the terms artificial stupidity (AS) and eternal creativity (EC) respectively. While both AS and EC acknowledge the need for a hybrid cognitive-affective approach to AI safety and overlap with regard to many short-term considerations, they differ fundamentally in the nature of multiple envisaged long-term solution patterns. By compiling relevant underlying contradistinctions, we aim to provide future-oriented incentives for constructive dialectics in practical and theoretical AI safety research.
This paper presents a novel alternative to Greedy Non-Maxima Suppression (NMS) in the task of bounding box selection and suppression in object detection. It proposes Confluence, an algorithm which does not rely solely on individual confidence scores to select optimal bounding boxes, nor does it rely on Intersection Over Union (IoU) to remove false positives. Using Manhattan Distance, it selects the bounding box which is closest to every other bounding box within the cluster and removes highly confluent neighboring boxes. Thus, Confluence represents a paradigm shift in bounding box selection and suppression as it is based on fundamentally different theoretical principles to Greedy NMS and its variants. Confluence is experimentally validated on RetinaNet, YOLOv3 and Mask-RCNN, using both the MS COCO and PASCAL VOC 2007 datasets. Confluence outperforms Greedy NMS in both mAP and recall on both datasets, using the challenging 0.50:0.95 mAP evaluation metric. On each detector and dataset, mAP was improved by 0.3-0.7% while recall was improved by 1.4-2.5%. A theoretical comparison of Greedy NMS and the Confluence Algorithm is provided, and quantitative results are supported by extensive qualitative results analysis. Furthermore, sensitivity analysis experiments across mAP thresholds support the conclusion that Confluence is more robust than NMS.
Autonomous navigation consists in an agent being able to navigate without human intervention or supervision, it affects both high level planning and low level control. Navigation is at the crossroad of multiple disciplines, it combines notions of computer vision, robotics and control. This work aimed at creating, in a simulation, a navigation pipeline whose transfer to the real world could be done with as few efforts as possible. Given the limited time and the wide range of problematic to be tackled, absolute navigation performances while important was not the main objective. The emphasis was rather put on studying the sim2real gap which is one the major bottlenecks of modern robotics and autonomous navigation. To design the navigation pipeline four main challenges arise; environment, localization, navigation and planning. The iGibson simulator is picked for its photo-realistic textures and physics engine. A topological approach to tackle space representation was picked over metric approaches because they generalize better to new environments and are less sensitive to change of conditions. The navigation pipeline is decomposed as a localization module, a planning module and a local navigation module. These modules utilize three different networks, an image representation extractor, a passage detector and a local policy. The laters are trained on specifically tailored tasks with some associated datasets created for those specific tasks. Localization is the ability for the agent to localize itself against a specific space representation. It must be reliable, repeatable and robust to a wide variety of transformations. Localization is tackled as an image retrieval task using a deep neural network trained on an auxiliary task as a feature descriptor extractor. The local policy is trained with behavioral cloning from expert trajectories gathered with ROS navigation stack.
Abstract--Edge computing is promising to become one of the next hottest topics in artificial intelligence because it benefits various evolving domains such as real-time unmanned aerial systems, industrial applications, and the demand for privacy protection. This paper reviews recent advances on binary neural network (BNN) and 1-bit CNN technologies that are well suitable for front-end, edge-based computing. We introduce and summarize existing work and classify them based on gradient approximation, quantization, architecture, loss functions, optimization method, and binary neural architecture search. We also introduce applications in the areas of computer vision and speech recognition and discuss future applications for edge computing. ITH the rapid development of information technology, cloud computing with centralized data processing cannot the performance of binary neural networks. To better review meet the needs of applications that require the processing these methods, we six aspects including gradient approximation, of massive amounts of data, nor can they be effectively used quantization, structural design, loss design, optimization, when privacy requires the data to remain at the source. Finally, we will also edge computing has become an alternative to handle the data review object detection, object tracking, and audio analysis from front-end or embedded devices.