Media
We owe the Trump admin a debt of gratitude for the Signal group chat leak
Sometimes journalists befuddle me, and I'm a journalist – although my touchy detractors would dispute that. Perhaps like you, I have been watching – with a healthy dose of bemusement and amusement – the outrage-du-jour dominate the latest 24-hour "news cycle" in North America and beyond. Such is the squirrel-like attention span of many of my perpetually outraged colleagues, that today's outrage usually has a short life expectancy since another outrage inevitably comes along tomorrow. But the outrage seizing Washington, DC – the capital of outrage – appears poised to consume the Beltway press corps for more than a day or two. When that happens, the outrage tends to evolve into a four-alarm scandal which journalists crave because it often translates into a big, ego-boosting award for the lucky scribe who triggered the original outrage.
Emotion Detection and Music Recommendation System
Kambham, Swetha, Jhonson, Hubert, Kambham, Sai Prathap Reddy
As artificial intelligence becomes more and more ingrained in daily life, we present a novel system that uses deep learning for music recommendation and emotion-based detection. Through the use of facial recognition and the DeepFace framework, our method analyses human emotions in real-time and then plays music that reflects the mood it has discovered. The system uses a webcam to take pictures, analyses the most common facial expression, and then pulls a playlist from local storage that corresponds to the mood it has detected. An engaging and customised experience is ensured by allowing users to manually change the song selection via a dropdown menu or navigation buttons. By continuously looping over the playlist, the technology guarantees continuity. The objective of our system is to improve emotional well-being through music therapy by offering a responsive and automated music-selection experience.
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Lin, Yan-Bo, Lin, Kevin, Yang, Zhengyuan, Li, Linjie, Wang, Jianfeng, Lin, Chung-Ching, Wang, Xiaofei, Bertasius, Gedas, Wang, Lijuan
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Da, Jeff, Forbes, Maxwell, Zellers, Rowan, Zheng, Anthony, Hwang, Jena D., Bosselut, Antoine, Choi, Yejin
Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.
Collaborative Evolution: Multi-Round Learning Between Large and Small Language Models for Emergent Fake News Detection
Zhou, Ziyi, Zhang, Xiaoming, Tan, Shenghan, Zhang, Litian, Li, Chaozhuo
The proliferation of fake news on social media platforms has exerted a substantial influence on society, leading to discernible impacts and deleterious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from the necessity for extensive supervised training and the challenge of adapting to rapidly evolving circumstances. Large language models (LLMs), despite their robust zero-shot capabilities, have fallen short in effectively identifying fake news due to a lack of pertinent demonstrations and the dynamic nature of knowledge. In this paper, a novel framework Multi-Round Collaboration Detection (MRCD) is proposed to address these aforementioned limitations. The MRCD framework is capable of enjoying the merits from both LLMs and SLMs by integrating their generalization abilities and specialized functionalities, respectively. Our approach features a two-stage retrieval module that selects relevant and up-to-date demonstrations and knowledge, enhancing in-context learning for better detection of emerging news events. We further design a multi-round learning framework to ensure more reliable detection results. Our framework MRCD achieves SOTA results on two real-world datasets Pheme and Twitter16, with accuracy improvements of 7.4\% and 12.8\% compared to using only SLMs, which effectively addresses the limitations of current models and improves the detection of emergent fake news.
A multi-agentic framework for real-time, autonomous freeform metasurface design
Lupoiu, Robert, Shao, Yixuan, Dai, Tianxiang, Mao, Chenkai, Edee, Kofi, Fan, Jonathan A.
Innovation in nanophotonics currently relies on human experts who synergize specialized knowledge in photonics and coding with simulation and optimization algorithms, entailing design cycles that are time-consuming, computationally demanding, and frequently suboptimal. We introduce MetaChat, a multi-agentic design framework that can translate semantically described photonic design goals into high-performance, freeform device layouts in an automated, nearly real-time manner. Multi-step reasoning is enabled by our Agentic Iterative Monologue (AIM) paradigm, which coherently interfaces agents with code-based tools, other specialized agents, and human designers. Design acceleration is facilitated by Feature-wise Linear Modulation-conditioned Maxwell surrogate solvers that support the generalized evaluation of metasurface structures. We use freeform dielectric metasurfaces as a model system and demonstrate with MetaChat the design of multi-objective, multi-wavelength metasurfaces orders of magnitude faster than conventional methods. These concepts present a scientific computing blueprint for utilizing specialist design agents, surrogate solvers, and human interactions to drive multi-physics innovation and discovery.
2 drones for the price of 1? Someone at the drone factory is getting fired.
Ever tried flying two drones at once? Probably not--because who can afford one of these toys nowadays, let alone double the money? Yeah, it seems someone at the drone factory made a huge mistake. They bumped a lever or something, and now we have so many extra drones we're offering this buy one get one free drone deal (or maybe it's just part of our spring sale). Through March 30, you can pay 99.97 for the Ninja Dragon Phantom Eagle PRO and get the Blade K free--perfect for sharing or gifting. The Dragon Phantom Eagle PRO drone is the more advanced of the bunch and likely the one you'll want to keep for yourself.
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Alzubi, Salaheddin, Brooks, Creston, Chiniya, Purva, Contente, Edoardo, von Gerlach, Chiara, Irwin, Lucas, Jiang, Yihan, Kaz, Arda, Nguyen, Windsor, Oh, Sewoong, Tyagi, Himanshu, Viswanath, Pramod
We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity's Sonar Reasoning Pro and OpenAI's GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs -- for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES -- with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.
Aether: Geometric-Aware Unified World Modeling
Aether Team, null, Zhu, Haoyi, Wang, Yifan, Zhou, Jianjun, Chang, Wenzheng, Zhou, Yang, Li, Zizun, Chen, Junyi, Shen, Chunhua, Pang, Jiangmiao, He, Tong
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Synthetic Video Enhances Physical Fidelity in Video Synthesis
Zhao, Qi, Ni, Xingyu, Wang, Ziyu, Cheng, Feng, Yang, Ziyan, Jiang, Lu, Wang, Bohan
W e investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. T o harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis.