Media
Embodied Spatial Intelligence: from Implicit Scene Modeling to Spatial Reasoning
This thesis introduces "Embodied Spatial Intelligence" to address the challenge of creating robots that can perceive and act in the real world based on natural language instructions. To bridge the gap between Large Language Models (LLMs) and physical embodiment, we present contributions on two fronts: scene representation and spatial reasoning. For perception, we develop robust, scalable, and accurate scene representations using implicit neural models, with contributions in self-supervised camera calibration, high-fidelity depth field generation, and large-scale reconstruction. For spatial reasoning, we enhance the spatial capabilities of LLMs by introducing a novel navigation benchmark, a method for grounding language in 3D, and a state-feedback mechanism to improve long-horizon decision-making. This work lays a foundation for robots that can robustly perceive their surroundings and intelligently act upon complex, language-based commands.
NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks
Chien, Yen-Che, Wang, Kuang-Da, Wang, Wei-Yao, Peng, Wen-Chih
Recent advances in autonomous digital agents from industry (e.g., Manus AI and Gemini's research mode) highlight potential for structured tasks by autonomous decision-making and task decomposition; however, it remains unclear to what extent the agent-based systems can improve multimodal web data productivity. We study this in the realm of journalism, which requires iterative planning, interpretation, and contextual reasoning from multimodal raw contents to form a well structured news. We introduce NEWSAGENT, a benchmark for evaluating how agents can automatically search available raw contents, select desired information, and edit and rephrase to form a news article by accessing core journalistic functions. Given a writing instruction and firsthand data as how a journalist initiates a news draft, agents are tasked to identify narrative perspectives, issue keyword-based queries, retrieve historical background, and generate complete articles. Unlike typical summarization or retrieval tasks, essential context is not directly available and must be actively discovered, reflecting the information gaps faced in real-world news writing. NEWSAGENT includes 6k human-verified examples derived from real news, with multimodal contents converted to text for broad model compatibility. We evaluate open- and closed-sourced LLMs with commonly-used agentic frameworks on NEWSAGENT, which shows that agents are capable of retrieving relevant facts but struggling with planning and narrative integration. We believe that NEWSAGENT serves a realistic testbed for iterating and evaluating agent capabilities in terms of multimodal web data manipulation to real-world productivity.
SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra
Lee, Changjae, Zhao, Zhuoyue, Xiong, Jinjun
The emergence of large-language models (LLMs) has enabled a new class of semantic data processing systems (SDPSs) to support declarative queries against unstructured documents. Existing SDPSs are, however, lacking a unified algebraic foundation, making their queries difficult to compose, reason, and optimize. We propose a new semantic algebra, SABER (Semantic Algebra Based on Extended Relational algebra), opening the possibility of semantic operations' logical plan construction, optimization, and formal correctness guarantees. We further propose to implement SABER in a SQL-compatible syntax so that it natively supports mixed structured/unstructured data processing. With SABER, we showcase the feasibility of providing a unified interface for existing SDPSs so that it can effectively mix and match any semantically-compatible operator implementation from any SDPS, greatly enhancing SABER's applicability for community contributions.
CoComposer: LLM Multi-agent Collaborative Music Composition
Xing, Peiwen, Plaat, Aske, van Stein, Niki
Existing AI Music composition tools are limited in generation duration, musical quality, and controllability. We introduce CoComposer, a multi-agent system that consists of five collaborating agents, each with a task based on the traditional music composition workflow. Using the AudioBox-Aesthetics system, we experimentally evaluate CoComposer on four compositional criteria. We test with three LLMs (GPT-4o, DeepSeek-V3-0324, Gemini-2.5-Flash), and find (1) that CoComposer outperforms existing multi-agent LLM-based systems in music quality, and (2) compared to a single-agent system, in production complexity. Compared to non- LLM MusicLM, CoComposer has better interpretability and editability, although MusicLM still produces better music.
Private, Verifiable, and Auditable AI Systems
The growing societal reliance on artificial intelligence necessitates robust frameworks for ensuring its security, accountability, and trustworthiness. This thesis addresses the complex interplay between privacy, verifiability, and auditability in modern AI, particularly in foundation models. It argues that technical solutions that integrate these elements are critical for responsible AI innovation. Drawing from international policy contributions and technical research to identify key risks in the AI pipeline, this work introduces novel technical solutions for critical privacy and verifiability challenges. Specifically, the research introduces techniques for enabling verifiable and auditable claims about AI systems using zero-knowledge cryptography; utilizing secure multi-party computation and trusted execution environments for auditable, confidential deployment of large language models and information retrieval; and implementing enhanced delegation mechanisms, credentialing systems, and access controls to secure interactions with autonomous and multi-agent AI systems. Synthesizing these technical advancements, this dissertation presents a cohesive perspective on balancing privacy, verifiability, and auditability in foundation model-based AI systems, offering practical blueprints for system designers and informing policy discussions on AI safety and governance.
From Sound to Sight: Towards AI-authored Music Videos
Vitasovic, Leo, Graßhof, Stella, Kloft, Agnes Mercedes, Lehtola, Ville V., Cunneen, Martin, Starostka, Justyna, McGarry, Glenn, Li, Kun, Brandt, Sami S.
Conventional music visualisation systems rely on handcrafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual workflows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.
Towards Compute-Optimal Many-Shot In-Context Learning
Golchin, Shahriar, Chen, Yanfei, Han, Rujun, Gandhi, Manan, Yu, Tianli, Mishra, Swaroop, Surdeanu, Mihai, Agarwal, Rishabh, Lee, Chen-Yu, Pfister, Tomas
Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.
Google sets date for Nest Cam, Gemini for Home reveal
After years of waiting for new Nest smart gear from Google, it appears we'll actually get our wish next month. "Gemini is coming to Google Home," a post reads on Google's official X account. "Come back October 1st," the message continues above an animation of a Nest camera peeking into the frame. A link labeled "sign up for updates" sends users to the Google Store, where they can sign up for news. It's not clear what type of reveal Google is planning for October 1; it could simply be a news release, or perhaps it will be a full-on media event.
Onion CEO Ben Collins Hasn't Given Up on Print--or Buying Infowars
Onion CEO Ben Collins Hasn't Given Up on Print--or Buying Infowars A year after relaunching The Onion as a newspaper, Collins visits to talk about why "going into something and not ruining it is bravery." Ben Collins made a big bet. A year ago, just a few months after he'd been named CEO of The Onion, he relaunched its print edition. Once a favorite on university campuses, The Onion hadn't published a physical issue since 2013 . Common wisdom said that readership, and advertising dollars, just weren't there for newspapers. But Collins, a fan of the satirical paper since childhood, thought "that's dumb." Readers celebrated The Onion's relaunch and the ability to read all of its bitingly funny headlines on a single broadsheet. Collins wouldn't give exact numbers on how many people are currently subscribed to the print edition but did say they should be enough to keep its writers' room humming (a few weeks after we taped this episode, the Wall Street Journal reported that The Onion now boasts more than 53,000 paying subscribers). On this episode of, I spoke with Collins about his hopes for The Onion, the future of journalism, and his Balatro addiction. KATIE DRUMMOND: Do you have a recent favorite Onion headline? Can I look it up for you? "Ghislaine Maxwell Can't Help but Notice Interview Room Covered in Plastic Sheeting." The staff churns out like 15 a day that are great. I sit there, and I still don't know how they do it. When I say they throw away eight or nine of the best sentences I would ever write every day, I mean that sincerely.
Clanker! This slur against robots is all over the internet – but is it offensive?
It sounds a bit insulting. It is, in fact, a slur. While it's sometimes used to denigrate actual robots – including delivery bots and self-driving cars – it's increasingly used to insult AI chatbots and platforms such as ChatGPT. I'm new to this – why would I want to insult AI? Does the AI care that you're insulting it? That's a complex and hotly debated philosophical question, to which the answer is "no".