Goto

Collaborating Authors

 upward



Appendix: LanguageModelswithImageDescriptors areStrongFew-ShotVideo-LanguageLearners

Neural Information Processing Systems

For VaTeX captioning and retrieval, we use the latest v1.1 version3, which contains 25,991 videos for training and 6,000 videos for public testing. The statistics can be found in Table 1. Visual genome synsets are pairs, where the keys are noisy natural language phrases and the values are the mapped WordNet synsets [6]. Ifavisualtokenoccurs in multiple frames, we use the averaged frame indexas its temporal indicator. Specifically,for UniVL, we set the number of epoches to be50 and the linear warmup steps to be40.


How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control

Li, Kunhang, Naradowsky, Jason, Feng, Yansong, Miyao, Yusuke

arXiv.org Artificial Intelligence

We explore the human motion knowledge of Large Language Models (LLMs) through 3D avatar control. Given a motion instruction, we prompt LLMs to first generate a high-level movement plan with consecutive steps (High-level Planning), then specify body part positions in each step (Low-level Planning), which we linearly interpolate into avatar animations. Using 20 representative motion instructions that cover fundamental movements and balance body part usage, we conduct comprehensive evaluations, including human and automatic scoring of both high-level movement plans and generated animations, as well as automatic comparison with oracle positions in low-level planning. Our findings show that LLMs are strong at interpreting high-level body movements but struggle with precise body part positioning. While decomposing motion queries into atomic components improves planning, LLMs face challenges in multi-step movements involving high-degree-of-freedom body parts. Furthermore, LLMs provide reasonable approximations for general spatial descriptions, but fall short in handling precise spatial specifications. Notably, LLMs demonstrate promise in conceptualizing creative motions and distinguishing culturally specific motion patterns.


Worm towers are all around us

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. Biologists estimate that four out of five animals on Earth are nematodes (AKA roundworms).The tiny, wriggling, transparent invertebrates are the most abundant creatures on the planet and are found nearly everywhere–from permafrost to the deep ocean. More than one million species make up this ubiquitous group, which includes parasites, decomposers, predators, and more. "They're not about to take over the world, because they already did," says Serena Ding, a biologist at the Max Planck Institute of Animal Behavior in Konstanz, Germany tells Popular Science. "Global worming has already happened."


Flamingos conjure 'water tornadoes' to trap their prey

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. A pink flamingo is typically associated with a laid back lifestyle, but the way that these leggy birds with big personalities feed is anything but chill. When they dip their curved necks into the water, the birds use their feet, heads, and beaks to create swirling water tornadoes to efficiently group their prey together and slurp up them up. The findings are detailed in a study published this week in the journal Proceedings of the National Academy of Sciences (PNAS). "Flamingos are actually predators, they are actively looking for animals that are moving in the water, and the problem they face is how to concentrate these animals, to pull them together and feed," Victor Ortega Jiménez, a study co-author and biologist specializing in biomechanics at the University of California, Berkeley, said in a statement.


Control of Biohybrid Actuators using NeuroEvolution

Alcaraz-Herrera, Hugo, Tsompanas, Michail-Antisthenis, Adamatzky, Andrew, Balaz, Igor

arXiv.org Artificial Intelligence

In medical-related tasks, soft robots can perform better than conventional robots because of their compliant building materials and the movements they are able perform. However, designing soft robot controllers is not an easy task, due to the non-linear properties of their materials. Since human expertise to design such controllers is yet not sufficiently effective, a formal design process is needed. The present research proposes neuroevolution-based algorithms as the core mechanism to automatically generate controllers for biohybrid actuators that can be used on future medical devices, such as a catheter that will deliver drugs. The controllers generated by methodologies based on Neuroevolution of Augmenting Topologies (NEAT) and Hypercube-based NEAT (HyperNEAT) are compared against the ones generated by a standard genetic algorithm (SGA). In specific, the metrics considered are the maximum displacement in upward bending movement and the robustness to control different biohybrid actuator morphologies without redesigning the control strategy. Results indicate that the neuroevolution-based algorithms produce better suited controllers than the SGA. In particular, NEAT designed the best controllers, achieving up to 25% higher displacement when compared with SGA-produced specialised controllers trained over a single morphology and 23% when compared with general purpose controllers trained over a set of morphologies.


TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Shangguan, Ziyao, Li, Chuhan, Ding, Yuxuan, Zheng, Yanan, Zhao, Yilun, Fitzgerald, Tesca, Cohan, Arman

arXiv.org Artificial Intelligence

Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape & trend, velocity & frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.


Tiny jellyfish robots made of ferrofluid can be controlled with light

New Scientist

Jellyfish-shaped robots made of magnetic ferrofluid can be controlled by light through an underwater obstacle course. Swarms of these soft robots could be useful for delivering chemicals throughout a liquid mixture or moving fluids through a lab-on-a-chip. Ferrofluid droplets are made of magnetic nanoparticles suspended in oil, and they can move across flat surfaces or change shape when coaxed in different directions by magnets. By immersing these droplets in water and exposing them to light, Mengmeng Sun at the Max Planck Institute for Intelligent Systems in Germany and his colleagues have now made them defy gravity. When ferrofluids absorb light – they are particularly good at that because they are dark – they heat up and any tiny bubbles within them expand.


Motion Generation from Fine-grained Textual Descriptions

Li, Kunhang, Feng, Yansong

arXiv.org Artificial Intelligence

The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.


Space and Move-optimal Arbitrary Pattern Formation on a Rectangular Grid by Robot Swarms

Sharma, Avisek, Ghosh, Satakshi, Goswami, Pritam, Sau, Buddhadeb

arXiv.org Artificial Intelligence

Arbitrary pattern formation (\textsc{Apf}) is a well-studied problem in swarm robotics. To the best of our knowledge, the problem has been considered in two different settings: one in a euclidean plane and another in an infinite grid. This work deals with the problem in an infinite rectangular grid setting. The previous works in literature dealing with the \textsc{Apf} problem in an infinite grid had a fundamental issue. These deterministic algorithms use a lot of space in the grid to solve the problem, mainly to maintain the asymmetry of the configuration or to avoid a collision. These solution techniques cannot be useful if there is a space constraint in the application field. In this work, we consider luminous robots (with one light that can take three colors) to avoid symmetry, but we carefully designed a deterministic algorithm that solves the \textsc{Apf} problem using the minimal required space in the grid. The robots are autonomous, identical, and anonymous, and they operate in Look-Compute-Move cycles under a fully asynchronous scheduler. The \textsc{Apf} algorithm proposed in \cite{BOSE2020} by Bose et al. can be modified using luminous robots so that it uses minimal space, but that algorithm is not move-optimal. The algorithm proposed in this paper not only uses minimal space but is also asymptotically move-optimal. The algorithm proposed in this work is designed for an infinite rectangular grid, but it can be easily modified to work on a finite grid as well.