counterclockwise
Towards Understanding Camera Motions in Any Video
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of 3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our core contributions is a taxonomy or "language" of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some primitives like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in(a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
Towards Understanding Camera Motions in Any Video
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of 3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our core contributions is a taxonomy or "language" of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some primitives like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in(a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
Strategyproof Facility Location for Five Agents on a Circle using PCD
We consider the strategyproof facility location problem on a circle. We focus on the case of 5 agents, and find a tight bound for the PCD strategyproof mechanism, which selects the reported location of an agent in proportion to the length of the arc in front of it. We methodically "reduce" the size of the instance space and then use standard optimization techniques to find and prove the bound is tight. Moreover we hypothesize the approximation ratio of PCD for general odd $n$.
Towards Understanding Camera Motions in Any Video
Lin, Zhiqiu, Cen, Siyuan, Jiang, Daniel, Karhade, Jay, Wang, Hewei, Mitra, Chancharik, Ling, Tiffany, Huang, Yuhan, Liu, Sifan, Chen, Mingyu, Zawar, Rushikesh, Bai, Xue, Du, Yilun, Gan, Chuang, Ramanan, Deva
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
Impact-resistant, autonomous robots inspired by tensegrity architecture
Johnson, William R. III, Huang, Xiaonan, Lu, Shiyang, Wang, Kun, Booth, Joran W., Bekris, Kostas, Kramer-Bottiglio, Rebecca
Future robots will navigate perilous, remote environments with resilience and autonomy. Researchers have proposed building robots with compliant bodies to enhance robustness, but this approach often sacrifices the autonomous capabilities expected of rigid robots. Inspired by tensegrity architecture, we introduce a tensegrity robot -- a hybrid robot made from rigid struts and elastic tendons -- that demonstrates the advantages of compliance and the autonomy necessary for task performance. This robot boasts impact resistance and autonomy in a field environment and additional advances in the state of the art, including surviving harsh impacts from drops (at least 5.7 m), accurately reconstructing its shape and orientation using on-board sensors, achieving high locomotion speeds (18 bar lengths per minute), and climbing the steepest incline of any tensegrity robot (28 degrees). We characterize the robot's locomotion on unstructured terrain, showcase its autonomous capabilities in navigation tasks, and demonstrate its robustness by rolling it off a cliff.
Autoregressive Large Language Models are Computationally Universal
Schuurmans, Dale, Dai, Hanjun, Zanini, Francesco
We show that autoregressive decoding of a transformer-based language model can realize universal computation, without external intervention or modification of the model's weights. Establishing this result requires understanding how a language model can process arbitrarily long inputs using a bounded context. For this purpose, we consider a generalization of autoregressive decoding where, given a long input, emitted tokens are appended to the end of the sequence as the context window advances. We first show that the resulting system corresponds to a classical model of computation, a Lag system, that has long been known to be computationally universal. By leveraging a new proof, we show that a universal Turing machine can be simulated by a Lag system with 2027 production rules. We then investigate whether an existing large language model can simulate the behaviour of such a universal Lag system. We give an affirmative answer by showing that a single system-prompt can be developed for gemini-1.5-pro-001 that drives the model, under deterministic (greedy) decoding, to correctly apply each of the 2027 production rules. We conclude that, by the Church-Turing thesis, prompted gemini-1.5-pro-001 with extended autoregressive (greedy) decoding is a general purpose computer.
Probing Mechanical Reasoning in Large Vision Language Models
Sun, Haoran, Gao, Qingying, Lyu, Haiyun, Luo, Dezhi, Deng, Hokin, Li, Yijiang
Mechanical reasoning is a fundamental ability that sets human intelligence apart from other animal intelligence. Mechanical reasoning allows us to design tools, build bridges and canals, and construct houses which set the foundation of human civilization. Embedding machines with such ability is an important step towards building human-level artificial intelligence. Recently, Li et al. built CogDevelop2K, a data-intensive cognitive experiment benchmark for assaying the developmental trajectory of machine intelligence (Li et al., 2024). Here, to investigate mechanical reasoning in Vision Language Models, we leverage the MechBench of CogDevelop2K, which contains approximately 150 cognitive experiments, to test understanding of mechanical system stability, gears and pulley systems, seesaw-like systems and leverage principle, inertia and motion, and other fluid-related systems in Large Vision Language Models. We observe diverse yet consistent behaviors over these aspects in VLMs.
A Motion Planning Algorithm in a Figure Eight Track
Jardon, Cristian, Sheppard, Brian, Zaveri, Veet
We design a motion planning algorithm to coordinate the movements of two robots along a figure eight track, in such a way that no collisions occur. We use a topological approach to robot motion planning that relates instabilities in motion planning algorithms to topological features of configuration spaces. The topological complexity of a configuration space is an invariant that measures the complexity of motion planning algorithms. We show that the topological complexity of our problem is 3 and construct an explicit algorithm with three continuous instructions.
Universal Syntactic Structures: Modeling Syntax for Various Natural Languages
Kim, Min K., Takero, Hafu, Fedovik, Sara
We aim to provide an explanation for how the human brain might connect words for sentence formation. A novel approach to modeling syntactic representation is introduced, potentially showing the existence of universal syntactic structures for all natural languages. As the discovery of DNA's double helix structure shed light on the inner workings of genetics, we wish to introduce a basic understanding of how language might work in the human brain. It could be the brain's way of encoding and decoding knowledge. It also brings some insight into theories in linguistics, psychology, and cognitive science. After looking into the logic behind universal syntactic structures and the methodology of the modeling technique, we attempt to analyze corpora that showcase universality in the language process of different natural languages such as English and Korean. Lastly, we discuss the critical period hypothesis, universal grammar, and a few other assertions on language for the purpose of advancing our understanding of the human brain.
Classification of Orbits in Poincar\'e Maps using Machine Learning
The quest for low-cost fusion power has led to the construction of experimental devices such as the DIII-D[8], an operational device for conducting magnetic fusion research, and ITER [16], an international project to help make the transition from studies of plasma physics to electricity-generating fusion power plants. These devices, called tokamaks, use magnetic fields to confine the fusion fuel in the form of a plasma, enabling physicists to perform experiments to determine the best shape for the hot reacting plasma and the magnetic fields necessary to hold it in place. To complement the experiments, computer simulations are used to gain an understanding of the complex physics of the plasmas, design new reactors, and select the parameters to be used in experiments. Data from both the experiments and the simulations are analyzed to provide the insights that will contribute to achieving the goal of fusion power. In this paper, we focus on a specific analysis problem that arises in both simulation and experimental data, namely, the classification of orbits in a Poincaré map, also called a Poincaré plot. These two-dimensional plots are obtained for planes, called poloidal planes, which intersect the torus-shaped tokamak perpendicular to the magnetic axis, as shown in Figure 1(a). A plot consists of several orbits, each composed of a number of points (Figure 1(b)). For a given orbit, these points are the intersections of a field line (the solid lines in Figure 1(a)) with a poloidal plane, as the field line is followed around the torus. There are four distinct shapes traced out by these points, leading to four classes of orbits: quasi-periodic, separatrix, island chain, and stochastic, as shown in Figure 2. In some cases, the orbit shows its distinctive shape with just a few points, corresponding to the first few intersections of the field line with the poloidal plane.