Media
Texas's Water Wars
As industrial operations move to the state, residents find that their drinking water has been promised to companies. In 2019, Corpus Christi, Texas's eighth-largest city, moved forward with plans to build a desalination plant. The facility, which was expected to be completed by 2023, at a cost of a hundred and forty million dollars, would convert seawater into fresh water to be used by the area's many refineries and chemical plants. The former mayor called it "a pretty significant day in the life of our city." In anticipation of the plant's opening, the city committed to provide tens of millions of gallons of water per day to new industrial operations, including a plastics plant co-owned by ExxonMobil and the Saudi Basic Industries Corporation, a lithium refinery for Tesla batteries, and a "specialty chemicals" plant operated by Chemours.
The Guilty Pleasure of the Heist
Elaborate robberies are a Hollywood staple, and the real-life theft at the Louvre has become a phenomenon. Why are we riveted by this particular type of crime? On October 19th, a group of masked men broke into the Louvre in broad daylight and made off with some of France's crown jewels. Suspects are now in custody, but the online fervor is still going strong. On this episode of Critics at Large, Vinson Cunningham, Naomi Fry, and Alexandra Schwartz discuss the sordid satisfaction of watching a heist play out, both onscreen and off.
Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
Singer, Assaf, Rotstein, Noam, Mann, Amir, Kimmel, Ron, Litany, Or
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.
LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
Sieker, Judith, Lachenmaier, Clara, Zarrieร, Sina
This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI's GPT-4-o, Meta's LLama-3-8B, and MistralAI's Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.
Steering Opinion Dynamics in Signed Time-Varying Networks via External Control Input
Priya, Swati, Tripathy, Twinkle
Abstract-- This paper studies targeted opinion formation in multi-agent systems evolving over signed, time-varying directed graphs. The dynamics of each agent's state follow a Laplacian-based update rule driven by both cooperative and antagonistic interactions in the presence of exogenous factors. We formulate these exogenous factors as external control inputs and establish a suitable controller design methodology enabling collective opinion to converge to any desired steady-state configuration, superseding the natural emergent clustering or polarization behavior imposed by persistently structurally balanced influential root nodes. Our approach leverages upper Dini derivative analysis and Gr onwall-type inequalities to establish exponential convergence for opinion magnitude towards the desired steady state configuration on networks with uniform quasi-strong ฮด-connectivity. Finally, the theoretical results are validated through extensive numerical simulations.
Data Assessment for Embodied Intelligence
Xiao, Jiahao, Yan, Bowen, Zhang, Jianbo, Wang, Jia, Li, Chunyi, Cheng, Zhengxue, Zhai, Guangtao
In embodied intelligence, datasets play a pivotal role, serving as both a knowledge repository and a conduit for information transfer. The two most critical attributes of a dataset are the amount of information it provides and how easily this information can be learned by models. However, the multimodal nature of embodied data makes evaluating these properties particularly challenging. Prior work has largely focused on diversity, typically counting tasks and scenes or evaluating isolated modalities, which fails to provide a comprehensive picture of dataset diversity. On the other hand, the learnability of datasets has received little attention and is usually assessed post-hoc through model training, an expensive, time-consuming process that also lacks interpretability, offering little guidance on how to improve a dataset. In this work, we address both challenges by introducing two principled, data-driven tools. First, we construct a unified multimodal representation for each data sample and, based on it, propose diversity entropy, a continuous measure that characterizes the amount of information contained in a dataset. Second, we introduce the first interpretable, data-driven algorithm to efficiently quantify dataset learnability without training, enabling researchers to assess a dataset's learnability immediately upon its release. We validate our algorithm on both simulated and real-world embodied datasets, demonstrating that it yields faithful, actionable insights that enable researchers to jointly improve diversity and learnability. We hope this work provides a foundation for designing higher-quality datasets that advance the development of embodied intelligence.
Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation
Ji, Shulei, Wang, Zihao, Yu, Jiaxing, Yang, Xiangyuan, Li, Shuyu, Wu, Songruoyao, Zhang, Kejun
Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons.
AI-generated podcasts: Synthetic Intimacy and Cultural Translation in NotebookLM's Audio Overviews
This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.
A Multi-Drone Multi-View Dataset and Deep Learning Framework for Pedestrian Detection and Tracking
Dakic, Kosta, Thilakarathna, Kanchana, Calheiros, Rodrigo N., Lim, Teng Joon
Multi-drone surveillance systems offer enhanced coverage and robustness for pedestrian tracking, yet existing approaches struggle with dynamic camera positions and complex occlusions. This paper introduces MATRIX (Multi-Aerial TRacking In compleX environments), a comprehensive dataset featuring synchronized footage from eight drones with continuously changing positions, and a novel deep learning framework for multi-view detection and tracking. Unlike existing datasets that rely on static cameras or limited drone coverage, MATRIX provides a challenging scenario with 40 pedestrians and a significant architectural obstruction in an urban environment. Our framework addresses the unique challenges of dynamic drone-based surveillance through real-time camera calibration, feature-based image registration, and multi-view feature fusion in bird's-eye-view (BEV) representation. Experimental results demonstrate that while static camera methods maintain over 90\% detection and tracking precision and accuracy metrics in a simplified MATRIX environment without an obstruction, 10 pedestrians and a much smaller observational area, their performance significantly degrades in the complex environment. Our proposed approach maintains robust performance with $\sim$90\% detection and tracking accuracy, as well as successfully tracks $\sim$80\% of trajectories under challenging conditions. Transfer learning experiments reveal strong generalization capabilities, with the pretrained model achieving much higher detection and tracking accuracy performance compared to training the model from scratch. Additionally, systematic camera dropout experiments reveal graceful performance degradation, demonstrating practical robustness for real-world deployments where camera failures may occur. The MATRIX dataset and framework provide essential benchmarks for advancing dynamic multi-view surveillance systems.