High-resolution satellite imagery has proven useful for a broad range of tasks, including measurement of global human population, local economic livelihoods, and biodiversity, among many others. Unfortunately, high-resolution imagery is both infrequently collected and expensive to purchase, making it hard to efficiently and effectively scale these downstream tasks over both time and space. We propose a new conditional pixel synthesis model that uses abundant, low-cost, low-resolution imagery to generate accurate high-resolution imagery at locations and times in which it is unavailable. We show that our model attains photo-realistic sample quality and outperforms competing baselines on a key downstream task -- object counting -- particularly in geographic locations where conditions on the ground are changing rapidly.
Neural shape representations have recently shown to be effective in shape analysis and reconstruction tasks. Existing neural network methods require point coordinates and corresponding normal vectors to learn the implicit level sets of the shape. Normal vectors are often not provided as raw data, therefore, approximation and reorientation are required as pre-processing stages, both of which can introduce noise. In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input. We show that incorporating a soft constraint on the divergence of the distance function favours smooth solutions that reliably orients gradients to match the unknown normal at each point, in some cases even better than approaches that use ground truth normal vectors directly. Additionally, we introduce a novel geometric initialization method for sinusoidal shape representation networks that further improves convergence to the desired solution. We evaluate the effectiveness of our approach on the task of surface reconstruction and show state-of-the-art performance compared to other unoriented methods and on-par performance compared to oriented methods.
A table arranging data in rows and columns is a very effective data structure, which has been widely used in business and scientific research. Considering large-scale tabular data in online and offline documents, automatic table recognition has attracted increasing attention from the document analysis community. Though human can easily understand the structure of tables, it remains a challenge for machines to understand that, especially due to a variety of different table layouts and styles. Existing methods usually model a table as either the markup sequence or the adjacency matrix between different table cells, failing to address the importance of the logical location of table cells, e.g., a cell is located in the first row and the second column of the table. In this paper, we reformulate the problem of table structure recognition as the table graph reconstruction, and propose an end-to-end trainable table graph reconstruction network (TGRNet) for table structure recognition. Specifically, the proposed method has two main branches, a cell detection branch and a cell logical location branch, to jointly predict the spatial location and the logical location of different cells. Experimental results on three popular table recognition datasets and a new dataset with table graph annotations (TableGraph-350K) demonstrate the effectiveness of the proposed TGRNet for table structure recognition. Code and annotations will be made publicly available.
The costly process of obtaining semantic segmentation labels has driven research towards weakly supervised semantic segmentation (WSSS) methods, using only image-level, point, or box labels. The lack of dense scene representation requires methods to increase complexity to obtain additional semantic information about the scene, often done through multiple stages of training and refinement. Current state-of-the-art (SOTA) models leverage image-level labels to produce class activation maps (CAMs) which go through multiple stages of refinement before they are thresholded to make pseudo-masks for supervision. The multi-stage approach is computationally expensive, and dependency on image-level labels for CAMs generation lacks generalizability to more complex scenes. In contrary, our method offers a single-stage approach generalizable to arbitrary dataset, that is trainable from scratch, without any dependency on pre-trained backbones, classification, or separate refinement tasks. We utilize point annotations to generate reliable, on-the-fly pseudo-masks through refined and filtered features. While our method requires point annotations that are only slightly more expensive than image-level annotations, we are to demonstrate SOTA performance on benchmark datasets (PascalVOC 2012), as well as significantly outperform other SOTA WSSS methods on recent real-world datasets (CRAID, CityPersons, IAD).
A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, adding the "missing human baseline" by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data are provided as a benchmark here: https://github.com/bethgelab/model-vs-human/
Due to the growing awareness of driving safety and the development of sophisticated technologies, advanced driving assistance system (ADAS) has been equipped in more and more vehicles with higher accuracy and lower price. The latest progress in this field has called for a review to sum up the conventional knowledge of ADAS, the state-of-the-art researches, and novel applications in real-world. With the help of this kind of review, newcomers in this field can get basic knowledge easier and other researchers may be inspired with potential future development possibility. This paper makes a general introduction about ADAS by analyzing its hardware support and computation algorithms. Different types of perception sensors are introduced from their interior feature classifications, installation positions, supporting ADAS functions, and pros and cons. The comparisons between different sensors are concluded and illustrated from their inherent characters and specific usages serving for each ADAS function. The current algorithms for ADAS functions are also collected and briefly presented in this paper from both traditional methods and novel ideas. Additionally, discussions about the definition of ADAS from different institutes are reviewed in this paper, and future approaches about ADAS in China are introduced in particular.
Large, deep networks can learn representations that generalize well. While smaller, more efficient networks lack the inductive biases to find these representations from training data alone, they may have the capacity to represent these solutions [e.g., 1, 16, 27, 39]. Influential work on knowledge distillation  argues that Bucilă et al.  "demonstrate convincingly that the knowledge acquired by a large ensemble of models [the teacher] can be transferred to a single small model [the student]". Indeed this quote encapsulates the conventional narrative of knowledge distillation: a student model learns a high-fidelity representation of a larger teacher, enabled by the teacher's soft labels. Conversely, in Figure 1 we show that with modern architectures knowledge distillation can lead to students with very different predictions from their teachers, even when the student has the capacity to perfectly match the teacher.
Motivated by the vast success of deep convolutional networks, there is a great interest in generalizing convolutions to non-Euclidean manifolds. A major complication in comparison to flat spaces is that it is unclear in which alignment a convolution kernel should be applied on a manifold. The underlying reason for this ambiguity is that general manifolds do not come with a canonical choice of reference frames (gauge). Kernels and features therefore have to be expressed relative to arbitrary coordinates. We argue that the particular choice of coordinatization should not affect a network's inference -- it should be coordinate independent. A simultaneous demand for coordinate independence and weight sharing is shown to result in a requirement on the network to be equivariant under local gauge transformations (changes of local reference frames). The ambiguity of reference frames depends thereby on the G-structure of the manifold, such that the necessary level of gauge equivariance is prescribed by the corresponding structure group G. Coordinate independent convolutions are proven to be equivariant w.r.t. those isometries that are symmetries of the G-structure. The resulting theory is formulated in a coordinate free fashion in terms of fiber bundles. To exemplify the design of coordinate independent convolutions, we implement a convolutional network on the M\"obius strip. The generality of our differential geometric formulation of convolutional networks is demonstrated by an extensive literature review which explains a large number of Euclidean CNNs, spherical CNNs and CNNs on general surfaces as specific instances of coordinate independent convolutions.
This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation. We argue that for high-dimensional multi-class data, the optimal linear discriminative representation maximizes the coding rate difference between the whole dataset and the average of all the subsets. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction objective naturally leads to a multi-layer deep network, named ReduNet, which shares common characteristics of modern deep networks. The deep layered architectures, linear and nonlinear operators, and even parameters of the network are all explicitly constructed layer-by-layer via forward propagation, although they are amenable to fine-tuning via back propagation. All components of so-obtained "white-box" network have precise optimization, statistical, and geometric interpretation. Moreover, all linear operators of the so-derived network naturally become multi-channel convolutions when we enforce classification to be rigorously shift-invariant. The derivation in the invariant setting suggests a trade-off between sparsity and invariance, and also indicates that such a deep convolution network is significantly more efficient to construct and learn in the spectral domain. Our preliminary simulations and experiments clearly verify the effectiveness of both the rate reduction objective and the associated ReduNet. All code and data are available at https://github.com/Ma-Lab-Berkeley. Keywords: rate reduction, linear discriminative representation, white-box deep network, multi-channel convolution, sparsity and invariance trade-off "What I cannot create, I do not understand."
Abstract--Commonly used metrics for evaluation of object detection systems (precision, recall, mAP) do not give complete information about their suitability of use in safety critical tasks, like obstacle detection for collision avoidance in Autonomous Vehicles (AV). Ranks are assigned based on an objective cyber-physical model for the risk of collision. Recall is measured for each rank. A front view scene from BDD100K  dataset with 4 labeled vehicles. Intuitively, the closer vehicles are more important to detect than those farther away.