Goto

Collaborating Authors

 Lippman, Andrew


Secure & Personalized Music-to-Video Generation via CHARCHA

arXiv.org Artificial Intelligence

Music is a deeply personal experience and our aim is to enhance this with a fullyautomated pipeline for personalized music video generation. Our work allows listeners to not just be consumers but co-creators in the music video generation process by creating personalized, consistent and context-driven visuals based on lyrics, rhythm and emotion in the music. The pipeline combines multimodal translation and generation techniques and utilizes low-rank adaptation on listeners' images to create immersive music videos that reflect both the music and the individual. To ensure the ethical use of users' identity, we also introduce CHARCHA, a facial identity verification protocol that protects people against unauthorized use of their face while at the same time collecting authorized images from users for personalizing their videos. This paper thus provides a secure and innovative framework for creating deeply personalized music videos. Figure 1: Image stills and lyrics from generated music videos for Rick Astley's "Never Gonna Give You Up," with character reference from CHARCHA. The videos use Queratogray Sketch[1], Western Animation Diffusion[2], and Realistic Vision V5.1[3] checkpoint models .


Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video

arXiv.org Artificial Intelligence

Recent advances in technology for hyper-realistic visual effects provoke the concern that deepfake videos of political speeches will soon be visually indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often when the same version of a story is presented as a video versus text. We conduct 4 pre-registered randomized experiments with 2,015 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, and media modalities. We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.


Trash to Treasure: Using text-to-image models to inform the design of physical artefacts

arXiv.org Artificial Intelligence

Text-to-image generative models have recently exploded in popularity and accessibility. Yet so far, use of these models in creative tasks that bridge the 2D digital world and the creation of physical artefacts has been understudied. We conduct a pilot study to investigate if and how text-to-image models can be used to assist in upstream tasks within the creative process, such as ideation and visualization, prior to a sculpture-making activity. Thirty participants selected sculpture-making materials and generated three images using the Stable Diffusion text-to-image generator, each with text prompts of their choice, with the aim of informing and then creating a physical sculpture. The majority of participants (23/30) reported that the generated images informed their sculptures, and 28/30 reported interest in using text-to-image models to help them in a creative task in the future. We identify several prompt engineering strategies and find that a participant's prompting strategy relates to their stage in the creative process. We discuss how our findings can inform support for users at different stages of the design process and for using text-to-image models for physical artefact design.


The Glass Infrastructure: Using Common Sense to Create a Dynamic, Place-Based Social Information System

AI Magazine

Then we add some world knowledge, in the form of commonsense statements, to help in the text understanding. The result combines this knowledge to form a multidimensional space where concepts, people, groups, and projects are all represented as vectors. From that space we retrieve information relevant to lab visitors--dynamically creating their presence in the vector space by creating a vector from the projects they have chosen as favorites. We then use the vector space to determine the relevance of objects in the space to each other--determining which projects are similar, which projects would be good fits for a lab visitor, and which projects fit which lab themes. Additionally, we have designed a user interface that makes this system easy and social to interact with. The following subections discuss our approach to interface design, our methods for extracting semantic information from the text base, and for assessing similarity of user interests with that knowledge.


Bayesian Video Shot Segmentation

Neural Information Processing Systems

Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.


Bayesian Video Shot Segmentation

Neural Information Processing Systems

Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.


Bayesian Video Shot Segmentation

Neural Information Processing Systems

Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.


Learning from User Feedback in Image Retrieval Systems

Neural Information Processing Systems

We formulate the problem of retrieving images from visual databases as a problem of Bayesian inference. This leads to natural and effective solutions for two of the most challenging issues in the design of a retrieval system: providing support for region-based queries without requiring prior image segmentation, and accounting for user-feedback during a retrieval session. We present a new learning algorithm that relies on belief propagation to account for both positive and negative examples of the user's interests.


Learning from User Feedback in Image Retrieval Systems

Neural Information Processing Systems

We formulate the problem of retrieving images from visual databases as a problem of Bayesian inference. This leads to natural and effective solutions for two of the most challenging issues in the design of a retrieval system: providing support for region-based queries without requiring prior image segmentation, and accounting for user-feedback during a retrieval session. We present a new learning algorithm that relies on belief propagation to account for both positive and negative examples of the user's interests.


Learning Mixture Hierarchies

Neural Information Processing Systems

The hierarchical representation of data has various applications in domains such as data mining, machine vision, or information retrieval. In this paper we introduce an extension of the Expectation-Maximization (EM) algorithm that learns mixture hierarchies in a computationally efficient manner. Efficiency is achieved by progressing in a bottom-up fashion, i.e. by clustering the mixture components of a given level in the hierarchy to obtain those of the level above. This cl ustering requires onl y knowledge of the mixture parameters, there being no need to resort to intermediate samples. In addition to practical applications, the algorithm allows a new interpretation of EM that makes clear the relationship with nonparametric kernel-based estimation methods, provides explicit control over the tradeoff between the bias and variance of EM estimates, and offers new insights about the behavior of deterministic annealing methods commonly used with EM to escape local minima of the likelihood.