frame
ObjectMover: Generative Object Movement with Video Prior
Yu, Xin, Wang, Tianyu, Kim, Soo Ye, Guerrero, Paul, Chen, Xi, Liu, Qing, Lin, Zhe, Qi, Xiaojuan
Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage its knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.
- Media > Photography (0.49)
- Leisure & Entertainment > Games > Computer Games (0.34)
- Information Technology > Software (0.34)
GigaSLAM: Large-Scale Monocular SLAM with Hierachical Gaussian Splats
Deng, Kai, Yang, Jian, Wang, Shenlong, Xie, Jin
Tracking and mapping in large-scale, unbounded outdoor environments using only monocular RGB input presents substantial challenges for existing SLAM systems. Traditional Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) SLAM methods are typically limited to small, bounded indoor settings. To overcome these challenges, we introduce GigaSLAM, the first NeRF/3DGS-based SLAM framework for kilometer-scale outdoor environments, as demonstrated on the KITTI and KITTI 360 datasets. Our approach employs a hierarchical sparse voxel map representation, where Gaussians are decoded by neural networks at multiple levels of detail. This design enables efficient, scalable mapping and high-fidelity viewpoint rendering across expansive, unbounded scenes. For front-end tracking, GigaSLAM utilizes a metric depth model combined with epipolar geometry and PnP algorithms to accurately estimate poses, while incorporating a Bag-of-Words-based loop closure mechanism to maintain robust alignment over long trajectories. Consequently, GigaSLAM delivers high-precision tracking and visually faithful rendering on urban outdoor benchmarks, establishing a robust SLAM solution for large-scale, long-term scenarios, and significantly extending the applicability of Gaussian Splatting SLAM systems to unbounded outdoor environments.
- Asia > China (0.28)
- North America > United States > Illinois (0.14)
- Europe > Switzerland (0.14)
The Digital Insider
Adobe takes home the award thanks to its new, exciting update to Premiere Pro: text-based editing. At NAB, Adobe showed us why Premiere Pro is the go-to editing software for so many editors. While text-based editing was the highlight for us, Adobe also unveiled an impressive range of new features across its Creative Cloud video programs. Adobe showcased new features in Premiere Pro that will be shipping in May. These included text-based editing along with an AI-based workflow powered by Adobe Sensei.
Graph data science: What you need to know
We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Whether you're genuinely interested in getting insights and solving problems using data, or just attracted by what has been called "the most promising career" by LinkedIn and the "best job in America" by Glassdoor, chances are you're familiar with data science. As we've elaborated previously, graphs are a universal data structure with manifestations that span a wide spectrum: from analytics to databases, and from knowledge management to data science, machine learning and even hardware. Graph data science is when you want to answer questions, not just with your data, but with the connections between your data points -- that's the 30-second explanation, according to Alicia Frame. Frame is the senior director of product management for data science at Neo4j, a leading graph database vendor.
RheFrameDetect: A Text Classification System for Automatic Detection of Rhetorical Frames in AI from Open Sources
Ghosh, Saurav, Loustaunau, Philippe
Rhetorical Frames in AI can be thought of as expressions that describe AI development as a competition between two or more actors, such as governments or companies. Examples of such Frames include robotic arms race, AI rivalry, technological supremacy, cyberwarfare dominance and 5G race. Detection of Rhetorical Frames from open sources can help us track the attitudes of governments or companies towards AI, specifically whether attitudes are becoming more cooperative or competitive over time. Given the rapidly increasing volumes of open sources (online news media, twitter, blogs), it is difficult for subject matter experts to identify Rhetorical Frames in (near) real-time. Moreover, these sources are in general unstructured (noisy) and therefore, detecting Frames from these sources will require state-of-the-art text classification techniques. In this paper, we develop RheFrameDetect, a text classification system for (near) real-time capture of Rhetorical Frames from open sources. Given an input document, RheFrameDetect employs text classification techniques at multiple levels (document level and paragraph level) to identify all occurrences of Frames used in the discussion of AI. We performed extensive evaluation of the text classification techniques used in RheFrameDetect against human annotated Frames from multiple news sources. To further demonstrate the effectiveness of RheFrameDetect, we show multiple case studies depicting the Frames identified by RheFrameDetect compared against human annotated Frames.
- Asia > China (0.28)
- Asia > Russia (0.14)
- North America > United States > California (0.04)
- (3 more...)
- Media > News (0.68)
- Government > Regional Government > Asia Government (0.46)
461
A recent article by Ronald Brachman (Brachman, 1985) points out some philosophical or semantic problems in using the notion of a prototype, which is described by using default properties. The problem arises since default properties can be overridden or cancelled in representing particular instances, and therefore lack definitional power: i.e., they are not really essential to the concept being represented. As an example, Brachman presents an elephant joke: Q: What's big and gray, has a trunk, and lives in the trees? A: An elephant-I lied about the trees. Before discussing a solution to this dilemma, consider the following modified version of the elephant joke, perhaps not quite as funny: Q: What's big and gray, has a trunk, and lives in the trees?
391
Department of Computer Science, Columbia University, New York, NY 10027 Abstract This article surveys a portion of the field of natural language processing. The main areas considered are those dealing with representation schemes, particularly work on physical object representation, and generalization processes driven by natural language understanding The emphasis of this article is on conceptual representation of objects based on the semantic interpretation of natural language input. Six programs serve as case studies for guiding the course of the article. Within the framework of describing each of these programs, several other programs, ideas, and theories that are relevant to the program in focus are presented. RECENT ADVANCES in natural language processing [NLP] have generated considerable interest within the Artificial Intelligence [AI] and Cognitive Science communities.
Various Views on Spatial Prepositions
In this article, principles involving the intrinsic, deictic, and extrinsic use of spatial prepositions are examined from linguistic, psychological, and AI approaches. First, I define some important terms. Second, those prepositions which permit intrinsic, deictic, and extrinsic use are specified. Third, I examine how the frame of reference is determined for all three cases. Fourth, I look at ambiguities in the use of prepositions and how they can be resolved.
Steps toward a Cognitive Vision System
An adequate natural language description of developments in a real-world scene can be taken as proof of "understanding what is going on." An algorithmic system that generates natural language descriptions from video recordings of road traffic scenes can be said to "understand" its input to the extent that algorithmically generated text is acceptable to the humans judging it. The ability to present a "variant formulation" without distorting the essential parts of the original message is taken as a cue that these essentials have been "understood." During art lessons, in particular those concerned with classical or ecclesiastic paintings, students are initially invited to merely describe what they see. Frequently, considerable a priori knowledge about ancient mythology or biblical traditions is required to succinctly characterize the depicted scene. Lack of the corresponding knowledge about other cultures can make it difficult for someone with only a European education to really understand and describe in an appropriate manner a painting by, for example, a Far East classic artist. Familiar human experiences mentioned in the preceding paragraph will now be "morphed" into a scientific challenge: to design and implement an algorithmic engine that generates an appropriate textual description of essential developments in a video sequence recorded from a real-world scene. Such an algorithmic engine will serve as one example of a cognitive vision system (CVS), which leaves room, as the experienced reader has noticed, for there to be more than one way to introduce the concept of a CVS. An alternative clearly consists in coupling a computer vision system with a robotic system of some kind and assessing the reactions of such a compound system. To whomever accepts the formulation, "one of the actions available to an agent is to produce language. This is called a speech act. Russell and Norvig (1995)" is unlikely to consider the two variants of a CVS alluded to previously as being fundamentally different. With regard to the first CVS version in particular, the following remarks are submitted for consideration: Obviously, we avoid a precise definition of understanding in favor of having humans compare the reaction of an algorithmic engine to that expected from a human. This fuzzy approach toward the circumscription of a CVS opens the road to constructive criticism--that is, to incremental system improvement--by pinpointing aspects of an output text that are not yet considered satisfactory.
947
What Is a Knowledge Representation? Although knowledge representation is one of the central and, in some ways, most familiar concepts in AI, the most fundamental question about it--What is it?--has Numerous papers have lobbied for one or another variety of representation, other papers have argued for various properties a representation should have, and still others have focused on properties that are important to the notion of representation in general. In this article, we go back to basics to address the question directly. We believe that the answer can best be understood in terms of five important and distinctly different roles that a representation plays, each of which places different and, at times, conflicting demands on the properties a representation should have.