recognition task
SNIPER: Efficient Multi-Scale Training
Instead of processing every pixel in an image pyramid, SNIPER processes context regions around ground-truth instances (referred to as chips) at the appropriate scale. For background sampling, these context-regions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER only processes 30% more pixels compared to the commonly used single scale training at 800x1333 pixels on the COCO dataset. But, it also observes samples from extreme resolutions of the image pyramid, like 1400x2000 pixels. As SNIPER operates on resampled low resolution chips (512x512 pixels), it can have a batch size as large as 20 on a single GPU even with a ResNet-101 backbone. Therefore it can benefit from batch-normalization during training without the need for synchronizing batch-normalization statistics across GPUs. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct. Our implementation based on Faster-RCNN with a ResNet-101 backbone obtains an mAP of 47.6% on the COCO dataset for bounding box detection and can process 5 images per second during inference with a single GPU.
MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset for Versatile Wireless Sensing Jianfei Y ang 1, He Huang 1, Y unjiao Zhou
MA TLAB, as shown in Table 2. To enhance the sensing quality, we have aggregated five adjacent frames into a new frame for use. WiFi CSI data, there are some "-inf" values in some sequences. The "-inf" number comes from the To facilitate the users, we have embedded these processing codes into our dataset tool. When the user loads our WiFi CSI data, these numbers will be handled by linear interpolation. As presented in Section 4.3, we provide the temporal Each sequence is annotated by at least 5 human annotators.
- North America > United States > Indiana (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- North America > United States > New York > Monroe County > Rochester (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.71)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > Canada > Quebec > Montreal (0.04)
max k [K] hik(x)>1 B/K1 min i [n ] min l [K] aill(x)/K. Step2. Weassumethathjr(x)attainsthelargestvalueofhik(x)foranyi [n],k [K]. Then hjr(x)>1 min
A.1 ImplementationDetails Network Architecture: Inspired by [33], we utilize a pre-trained ResNet-50 [20] as the feature extractor for object recognition tasks (i.e., Office-31 [22], Office-Caltech [18] and Office-Home [46]). Theoverallframeworkis trained under an end-to-end manner via back-propagation. The stochastic gradient descent with momentum value as 0.9 is employed as the network optimizer. The initial learning rates for feature extractor and bottleneck layer are respectively set as 10 3 and 10 2, while the parameters of classifier are frozen. It is exponentially decayed as the training process.
Learning long-range spatial dependencies with horizontal gated recurrent units
Progress in deep learning has spawned great successes in many engineering applications. As a prime example, convolutional neural networks, a type of feedforward neural networks, are now approaching -- and sometimes even surpassing -- human accuracy on a variety of visual recognition tasks. Here, however, we show that these neural networks and their recent extensions struggle in recognition tasks where co-dependent visual features must be detected over long spatial ranges. We introduce a visual challenge, Pathfinder, and describe a novel recurrent neural network architecture called the horizontal gated recurrent unit (hGRU) to learn intrinsic horizontal connections -- both within and across feature columns. We demonstrate that a single hGRU layer matches or outperforms all tested feedforward hierarchical baselines including state-of-the-art architectures with orders of magnitude more parameters.
Are Vision Transformers More Data Hungry Than Newborn Visual Systems?
Vision transformers (ViTs) are top-performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more "data hungry" than brains, with ViTs requiring more training data than brains to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled-rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self-supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained "through the eyes" of newborn chicks, the ViTs solved the same view-invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn chicks: both learned view-invariant object representations in impoverished visual environments. The flexible and generic attention-based learning mechanism in ViTs--combined with the embodied data streams available to newborn animals--appears sufficient to drive the development of animal-like object recognition.