Goto

Collaborating Authors

 Vision


Its now a federal crime to publish AI deepfake porn

Mashable

The Take It Down Act, a controversial bipartisan bill recently hailed by First Lady Melania Trump as a tool to build a safer internet, is officially law, as President Donald Trump took to the White House Rose Garden today to put ink to legislative paper. It's the first high-profile tech legislation to pass under the new administration. "With the rise of AI image generation, countless women have been harassed with deepfakes and other explicit images distributed against their will. This is wrong, so horribly wrong, and it's a very abusive situation," said Trump at the time of signing. "This will be the first ever federal law to combat the distribution of explicit, imaginary, posted without subject's consent... We've all heard about deepfakes. I have them all the time, but nobody does anything. I ask Pam [Bondi], 'Can you help me Pam?' She says, 'No I'm too busy doing other things. But a lot of people don't survive, that's true and so horrible... Today, we're making it totally illegal."


Deep love or deepfake: Dating in the time of AI

The Japan Times

Beth Hyland thought she had met the love of her life on Tinder. In reality, the Michigan-based administrative assistant had been manipulated by an online scam artist who posed as a French man named "Richard," used deepfake video on Skype calls and posted photos of another man to pull off his con. Deepfakes -- manipulated video or audio made using artificial intelligence to look and sound real -- are often difficult to detect without specialized tools.


A Appendix

Neural Information Processing Systems

A.1 Conventional Test-Time Augmentation Center-Crop is the standard test-time augmentation for most of computer vision tasks [56, 29, 5, 7, 18, 26, 52]. The Center-Crop first resizes an image to a fixed size and then crops the central area to make a predefined input size. We resize an image to 256 pixels and crop the central 224 pixels for ResNet-50 in ImageNet experiment, as the same way as [18, 26, 52]. In the case of CIFAR, all images in the dataset are 32 by 32 pixels; we use the original images without any modification at the test time. Horizontal-Flip is an ensemble method using the original image and the horizontally inverted image.


Learning Loss for Test-Time Augmentation

Neural Information Processing Systems

Data augmentation has been actively studied for robust neural networks. Most of the recent data augmentation methods focus on augmenting datasets during the training phase. At the testing phase, simple transformations are still widely used for test-time augmentation. This paper proposes a novel instance-level testtime augmentation that efficiently selects suitable transformations for a test input. Our proposed method involves an auxiliary module to predict the loss of each possible transformation given the input. Then, the transformations having lower predicted losses are applied to the input.



Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Neural Information Processing Systems

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one example, a query like "a pink sunflower and a yellow flamingo" may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.


Event-3DGS: Event-based 3D Reconstruction Using 3D Gaussian Splatting

Neural Information Processing Systems

Event cameras, offering high temporal resolution and high dynamic range, have brought a new perspective to addressing 3D reconstruction challenges in fast-motion and low-light scenarios. Most methods use the Neural Radiance Field (NeRF) for event-based photorealistic 3D reconstruction. However, these NeRF methods suffer from time-consuming training and inference, as well as limited scene-editing capabilities of implicit representations. To address these problems, we propose Event-3DGS, the first event-based reconstruction using 3D Gaussian splatting (3DGS) for synthesizing novel views freely from event streams. Technically, we first propose an event-based 3DGS framework that directly processes event data and reconstructs 3D scenes by simultaneously optimizing scenario and sensor parameters.


Cal-DETR: Calibrated Detection Transformer

Neural Information Processing Systems

Albeit revealing impressive predictive performance for several computer vision tasks, deep neural networks (DNNs) are prone to making overconfident predictions. This limits the adoption and wider utilization of DNNs in many safety-critical applications. There have been recent efforts toward calibrating DNNs, however, almost all of them focus on the classification task. Surprisingly, very little attention has been devoted to calibrating modern DNN-based object detectors, especially detection transformers, which have recently demonstrated promising detection performance and are influential in many decision-making systems. In this work, we address the problem by proposing a mechanism for calibrated detection transformers (Cal-DETR), particularly for Deformable-DETR, UP-DETR, and DINO.


UP-NeRF: Unconstrained Pose Prior-Free Neural Radiance Field

Neural Information Processing Systems

Neural Radiance Field (NeRF) has enabled novel view synthesis with high fidelity given images and camera poses. Subsequent works even succeeded in eliminating the necessity of pose priors by jointly optimizing NeRF and camera pose. However, these works are limited to relatively simple settings such as photometrically consistent and occluder-free image collections or a sequence of images from a video. So they have difficulty handling unconstrained images with varying illumination and transient occluders. In this paper, we propose UP-NeRF (Unconstrained Pose-prior-free Neural Radiance Fields) to optimize NeRF with unconstrained image collections without camera pose prior.


Amnesia as a Catalyst for Enhancing Black Box Pixel Attacks in Image Classification and Object Detection

Neural Information Processing Systems

It is well known that query-based attacks tend to have relatively higher successrates in adversarial black-box attacks. While research on black-box attacks is activelybeing conducted, relatively few studies have focused on pixel attacks thattarget only a limited number of pixels. In image classification, query-based pixelattacks often rely on patches, which heavily depend on randomness and neglectthe fact that scattered pixels are more suitable for adversarial attacks. Moreover, tothe best of our knowledge, query-based pixel attacks have not been explored in thefield of object detection. To address these issues, we propose a novel pixel-basedblack-box attack called Remember and Forget Pixel Attack using ReinforcementLearning(RFPAR), consisting of two main components: the Remember and Forgetprocesses. RFPAR mitigates randomness and avoids patch dependency byleveraging rewards generated through a one-step RL algorithm to perturb pixels.RFPAR effectively creates perturbed images that minimize the confidence scoreswhile adhering to limited pixel constraints.