Goto

Collaborating Authors

 Vision


French prosecutors suspect Musk encouraged deepfakes row to inflate X value

The Japan Times

Elon Musk-owned X's Grok AI chatbot stirred outrage earlier this year over it generating images of naked women and girls without their consent. Paris - French prosecutors said Saturday they had alerted U.S. authorities to a suspicion that tech tycoon Elon Musk had encouraged controversy over sexualized deepfakes on X to artificially increase the value of his company. The social media network's Grok AI chatbot stirred outrage earlier this year over it generating images of naked women and girls without their consent. The controversy sparked by sexually explicit deepfakes generated by Grok (X's AI) may have been deliberately generated in order to artificially boost the value of companies X and xAI, the Paris prosecutor's office said, confirming a report in Le Monde newspaper on Friday. In a time of both misinformation and too much information, quality journalism is more crucial than ever. By subscribing, you can help us get the story right.


Essex police pause facial recognition camera use after study finds racial bias

The Guardian

Academics discover black people'significantly more likely' to be identified when compared with other ethnic groups Essex police have paused the use of live facial recognition (LFR) technology after a study found cameras were significantly more likely to target black people than people of other ethnicities. The move to suspend use of the AI-enabled systems was revealed by the Information Commissioner's Office (ICO), which regulates the use of the technology deployed so far by at least 13 police forces in London, south and north Wales, Leicestershire, Northamptonshire, Hampshire, Bedfordshire, Suffolk, Greater Manchester, West Yorkshire, Surrey and Sussex. The ICO said Essex police had paused LFR deployments "after identifying potential accuracy and bias risks" and warned other forces to have mitigations in place. LFR systems are either mounted to fixed locations or deployed in vans. In January, the home secretary, Shabana Mahmood, announced the number of LFR vans would increase five-fold, with 50 available to every police force in England and Wales. Essex commissioned University of Cambridge academics to conduct a study, which involved 188 actors walking past cameras being actively deployed from marked police vans in Chelmsford.


Beyond Grids: Learning Graph Representations for Visual Recognition

Neural Information Processing Systems

We propose learning graph representations from 2D feature maps for visual recognition. Our method draws inspiration from region based recognition, and learns to transform a 2D image into a graph structure. The vertices of the graph define clusters of pixels (regions), and the edges measure the similarity between these clusters in a feature space. Our method further learns to propagate information across all vertices on the graph, and is able to project the learned graph representation back into 2D grids. Our graph representation facilitates reasoning beyond regular grids and can capture long range dependencies among regions. We demonstrate that our model can be trained from end-to-end, and is easily integrated into existing networks. Finally, we evaluate our method on three challenging recognition tasks: semantic segmentation, object detection and object instance segmentation. For all tasks, our method outperforms state-of-the-art methods.


Incorporating Side Information by Adaptive Convolution

Neural Information Processing Systems

Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in counting systems based on deep learning. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information.


Graph Matching via Multiplicative Update Algorithm

Neural Information Processing Systems

As a fundamental problem in computer vision, graph matching problem can usually be formulated as a Quadratic Programming (QP) problem with doubly stochastic and discrete (integer) constraints. Since it is NP-hard, approximate algorithms are required. In this paper, we present a new algorithm, called Multiplicative Update Graph Matching (MPGM), that develops a multiplicative update technique to solve the QP matching problem. MPGM has three main benefits: (1) theoretically, MPGM solves the general QP problem with doubly stochastic constraint naturally whose convergence and KKT optimality are guaranteed.


Associative Embedding: End-to-End Learning for Joint Detection and Grouping

Neural Information Processing Systems

We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping. A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually the grouping of detections is achieved with multi-stage pipelines, instead we propose an approach that teaches a network to simultaneously output detections and group assignments. This technique can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions. We show how to apply this method to multi-person pose estimation and report state-of-the-art performance on the MPII and MS-COCO datasets.


Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis

Neural Information Processing Systems

Synthesizing realistic profile faces is promising for more efficiently training deep pose-invariant models for large-scale unconstrained face recognition, by populating samples with extreme poses and avoiding tedious annotations. However, learning from synthetic faces may not achieve the desired performance due to the discrepancy between distributions of the synthetic and real face images. To narrow this gap, we propose a Dual-Agent Generative Adversarial Network (DA-GAN) model, which can improve the realism of a face simulator's output using unlabeled real faces, while preserving the identity information during the realism refinement. The dual agents are specifically designed for distinguishing real v.s.


Modulating early visual processing by language

Neural Information Processing Systems

It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view dominates the current literature in computational models for language-vision tasks, where visual and linguistic inputs are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the \emph{entire visual processing} by a linguistic input. Specifically, we introduce Conditional Batch Normalization (CBN) as an efficient mechanism to modulate convolutional feature maps by a linguistic embedding. We apply CBN to a pre-trained Residual Network (ResNet), leading to the MODulatEd ResNet (\MRN) architecture, and show that this significantly improves strong baselines on two visual question answering tasks. Our ablation study confirms that modulating from the early stages of the visual processing is beneficial.


MaskRNN: Instance Level Video Object Segmentation

Neural Information Processing Systems

Instance level video object segmentation is an important technique for video editing and compression. To capture the temporal coherence, in this paper, we develop MaskRNN, a recurrent neural net approach which fuses in each frame the output of two deep nets for each object instance - a binary segmentation net providing a mask and a localization net providing a bounding box. Due to the recurrent component and the localization component, our method is able to take advantage of long-term temporal structures of the video data as well as rejecting outliers. We validate the proposed algorithm on three challenging benchmark datasets, the DAVIS-2016 dataset, the DAVIS-2017 dataset, and the Segtrack v2 dataset, achieving state-of-the-art performance on all of them.


Attentional Pooling for Action Recognition

Neural Information Processing Systems

We introduce a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks. Our proposed attention module can be trained with or without extra supervision, and gives a sizable boost in accuracy while keeping the network size and computational cost nearly the same. It leads to significant improvements over state of the art base architecture on three standard action recognition benchmarks across still images and videos, and establishes new state of the art on MPII dataset with 12.5% relative improvement. We also perform an extensive analysis of our attention module both empirically and analytically. In terms of the latter, we introduce a novel derivation of bottom-up and top-down attention as low-rank approximations of bilinear pooling methods (typically used for fine-grained classification). From this perspective, our attention formulation suggests a novel characterization of action recognition as a fine-grained recognition problem.