Artificial Intelligence (AI) is not just a buzzword, but a crucial part of the technology landscape. AI is changing every industry and business function, which results in increased interest in its applications, subdomains and related fields. This makes AI companies the top leaders driving the technology swift. AI helps us to optimise and automate crucial business processes, gather essential data and transform the world, one step at a time. From Google and Amazon to Apple and Microsoft, every major tech company is dedicating resources to breakthroughs in artificial intelligence. As big enterprises are busy acquiring or merging with other emerging inventions, small AI companies are also working hard to develop their own intelligent technology and services. By leveraging artificial intelligence, organizations get an innovative edge in the digital age. AI consults are also working to provide companies with expertise that can help them grow. In this digital era, AI is also a significant place for investment. AI companies are constantly developing the latest products to provide the simplest solutions. Henceforth, Analytics Insight brings you the list of top 100 AI companies that are leading the technology drive towards a better tomorrow. AEye develops advanced vision hardware, software, and algorithms that act as the eyes and visual cortex of autonomous vehicles. AEye is an artificial perception pioneer and creator of iDAR, a new form of intelligent data collection that acts as the eyes and visual cortex of autonomous vehicles. Since its demonstration of its solid state LiDAR scanner in 2013, AEye has pioneered breakthroughs in intelligent sensing. Their mission was to acquire the most information with the fewest ones and zeros. This would allow AEye to drive the automotive industry into the next realm of autonomy. Algorithmia invented the AI Layer.
With the increasing reliance of self-driving and similar robotic systems on robust 3D vision, the processing of LiDAR scans with deep convolutional neural networks has become a trend in academia and industry alike. Prior attempts on the challenging Semantic Scene Completion task - which entails the inference of dense 3D structure and associated semantic labels from "sparse" representations - have been, to a degree, successful in small indoor scenes when provided with dense point clouds or dense depth maps often fused with semantic segmentation maps from RGB images. However, the performance of these systems drop drastically when applied to large outdoor scenes characterized by dynamic and exponentially sparser conditions. Likewise, processing of the entire sparse volume becomes infeasible due to memory limitations and workarounds introduce computational inefficiency as practitioners are forced to divide the overall volume into multiple equal segments and infer on each individually, rendering real-time performance impossible. In this work, we formulate a method that subsumes the sparsity of large-scale environments and present S3CNet, a sparse convolution based neural network that predicts the semantically completed scene from a single, unified LiDAR point cloud. We show that our proposed method outperforms all counterparts on the 3D task, achieving state-of-the art results on the SemanticKITTI benchmark. Furthermore, we propose a 2D variant of S3CNet with a multi-view fusion strategy to complement our 3D network, providing robustness to occlusions and extreme sparsity in distant regions. We conduct experiments for the 2D semantic scene completion task and compare the results of our sparse 2D network against several leading LiDAR segmentation models adapted for bird's eye view segmentation on two open-source datasets.
As we make tremendous advances in machine learning and artificial intelligence technosciences, there is a renewed understanding in the AI community that we must ensure that humans being are at the center of our deliberations so that we don't end in technology-induced dystopias. As strongly argued by Green in his book Smart Enough City, the incorporation of technology in city environs does not automatically translate into prosperity, wellbeing, urban livability, or social justice. There is a great need to deliberate on the future of the cities worth living and designing. There are philosophical and ethical questions involved along with various challenges that relate to the security, safety, and interpretability of AI algorithms that will form the technological bedrock of future cities. Several research institutes on human centered AI have been established at top international universities. Globally there are calls for technology to be made more humane and human-compatible. For example, Stuart Russell has a book called Human Compatible AI. The Center for Humane Technology advocates for regulators and technology companies to avoid business models and product features that contribute to social problems such as extremism, polarization, misinformation, and Internet addiction. In this paper, we analyze and explore key challenges including security, robustness, interpretability, and ethical challenges to a successful deployment of AI or ML in human-centric applications, with a particular emphasis on the convergence of these challenges. We provide a detailed review of existing literature on these key challenges and analyze how one of these challenges may lead to others or help in solving other challenges. The paper also advises on the current limitations, pitfalls, and future directions of research in these domains, and how it can fill the current gaps and lead to better solutions.
In this era of active development of autonomous vehicles, it becomes crucial to provide driving systems with the capacity to explain their decisions. In this work, we focus on generating high-level driving explanations as the vehicle drives. We present BEEF, for BEhavior Explanation with Fusion, a deep architecture which explains the behavior of a trajectory prediction model. Supervised by annotations of human driving decisions justifications, BEEF learns to fuse features from multiple levels. Leveraging recent advances in the multi-modal fusion literature, BEEF is carefully designed to model the correlations between high-level decisions features and mid-level perceptual features. The flexibility and efficiency of our approach are validated with extensive experiments on the HDD and BDD-X datasets.
Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at https://github.com/fregu856/ebms_3dod.
This paper proposes a multi-sensor based approach to detect, track, and localize a quadcopter unmanned aerial vehicle (UAV). Specifically, a pipeline is developed to process monocular RGB and thermal video (captured from a fixed platform) to detect and track the UAV in our FoV. Subsequently, a 2D planar lidar is used to allow conversion of pixel data to actual distance measurements, and thereby enable localization of the UAV in global coordinates. The monocular data is processed through a deep learning-based object detection method that computes an initial bounding box for the UAV. The thermal data is processed through a thresholding and Kalman filter approach to detect and track the bounding box. Training and testing data are prepared by combining a set of original experiments conducted in a motion capture environment and publicly available UAV image data. The new pipeline compares favorably to existing methods and demonstrates promising tracking and localization capacity of sample experiments.
We present a classification based approach for the next best view selection and show how we can plausibly obtain a supervisory signal for this task. The proposed approach is end-to-end trainable and aims to get the best possible 3D reconstruction quality with a pair of passively acquired 2D views. The proposed model consists of two stages: a classifier and a reconstructor network trained jointly via the indirect 3D supervision from ground truth voxels. While testing, the proposed method assumes no prior knowledge of the underlying 3D shape for selecting the next best view. We demonstrate the proposed method's effectiveness via detailed experiments on synthetic and real images and show how it provides improved reconstruction quality than the existing state of the art 3D reconstruction and the next best view prediction techniques.
Predicting future states or actions of a given system remains a fundamental, yet unsolved challenge of intelligence, especially in the scope of complex and non-deterministic scenarios, such as modeling behavior of humans. Existing approaches provide results under strong assumptions concerning unimodality of future states, or, at best, assuming specific probability distributions that often poorly fit to real-life conditions. In this work we introduce a robust and flexible probabilistic framework that allows to model future predictions with virtually no constrains regarding the modality or underlying probability distribution. To achieve this goal, we leverage a hypernetwork architecture and train a continuous normalizing flow model. The resulting method dubbed RegFlow achieves state-of-the-art results on several benchmark datasets, outperforming competing approaches by a significant margin.
Autonomous navigation consists in an agent being able to navigate without human intervention or supervision, it affects both high level planning and low level control. Navigation is at the crossroad of multiple disciplines, it combines notions of computer vision, robotics and control. This work aimed at creating, in a simulation, a navigation pipeline whose transfer to the real world could be done with as few efforts as possible. Given the limited time and the wide range of problematic to be tackled, absolute navigation performances while important was not the main objective. The emphasis was rather put on studying the sim2real gap which is one the major bottlenecks of modern robotics and autonomous navigation. To design the navigation pipeline four main challenges arise; environment, localization, navigation and planning. The iGibson simulator is picked for its photo-realistic textures and physics engine. A topological approach to tackle space representation was picked over metric approaches because they generalize better to new environments and are less sensitive to change of conditions. The navigation pipeline is decomposed as a localization module, a planning module and a local navigation module. These modules utilize three different networks, an image representation extractor, a passage detector and a local policy. The laters are trained on specifically tailored tasks with some associated datasets created for those specific tasks. Localization is the ability for the agent to localize itself against a specific space representation. It must be reliable, repeatable and robust to a wide variety of transformations. Localization is tackled as an image retrieval task using a deep neural network trained on an auxiliary task as a feature descriptor extractor. The local policy is trained with behavioral cloning from expert trajectories gathered with ROS navigation stack.
Recent works in self-supervised video prediction have mainly focused on passive forecasting and low-level action-conditional prediction, which sidesteps the problem of semantic learning. We introduce the task of semantic action-conditional video prediction, which can be regarded as an inverse problem of action recognition. The challenge of this new task primarily lies in how to effectively inform the model of semantic action information. To bridge vision and language, we utilize the idea of capsule and propose a novel video prediction model Action Concept Grounding Network (ACGN). Our method is evaluated on two newly designed synthetic datasets, CLEVR-Building-Blocks and Sapien-Kitchen, and experiments show that given different action labels, our ACGN can correctly condition on instructions and generate corresponding future frames without need of bounding boxes. We further demonstrate our trained model can make out-of-distribution predictions for concurrent actions, be quickly adapted to new object categories and exploit its learnt features for object detection. Additional visualizations can be found at https://iclr-acgn.github.io/ACGN/.