Goto

Collaborating Authors

 cathedral


"Final Boy," by Sam Lipsyte

The New Yorker

Thing is, I've been trying to find a moment to write down what happened to Bennett and me for a while now, but the demands of my audience rarely abate. I've hardly time to jot down a grocery list, let alone compose a personal chronicle. Bennett says I'm practically the Charles (as in Dickens) of scribblers devoted to mining the rich vein of a certain underappreciated sitcom of the nineteen-eighties, but I will leave that for history to judge. Besides, what does Bennett know? Just before he got that way, I was in Amok Mocha, where I like to sip cold brew and do my "C: FB" conjuring, and I struck up a conversation with a young woman who confessed to being a creative-writing student. She told me that in her workshop they talk about the "occasion" of the story. Why is the narrator telling this tale now? What pressures or conditions have coalesced to move a person to speak? I feigned ignorance of the concept, though I'd heard it often in my own writing classes long ago. Instead, I told her that, if the installment I was presently crafting flowed from any occasion, it was this: Charles is anxious about the imminent disintegration of the universe via the ever-increasing tug of dark matter. Moreover, he's ticked off that his best buddy, Buddy, doesn't seem perturbed by the prospect. "How imminent?" the woman said, and sipped her Balkan, a new offering at Amok. When I informed her that he was the titular hero of "Charles in Charge," the most criminally uncelebrated television program of the Reagan era, the woman pursed her lips. "We all write fan fiction," I told her. "Some of us are just more honest about it." The young woman gathered up her belongings, moved to another table. Did she think I was being facetious? Still, if there is an occasion for the story I'm relating now, it's a bit nearer on the space-time continuum. My best buddy, Bennett, is in a vegetative state induced by an anoxic brain injury, and, if he doesn't wake up soon and vouch for me, I could be kicked out of our apartment.



Doge wants to replace our institutions with a tech utopia. It won't work Mike Pepi

The Guardian

Elon Musk has stepped away from Doge with very little "efficiency" to show for it. While it may have been more of a showpiece than real policy, this brutal and short experiment in Silicon Valley governance reveals a long-simmering battle between digital utopians and the institutional infrastructures critical to functioning democracies. Doge's website dubiously claims 190bn in savings. The receipts show that they are less about efficiency than they are aimed at effective dissolution, a fate met by USAID, the federal agency responsible for distributing foreign assistance. These brash new reductions are not just your garden-variety small-government crusades or culture-war skirmishes.


The Limits of A.I.-Generated Miyazaki

The New Yorker

If asked to come up with a quintessentially "human" work of art, one could do worse than to name a film by Studio Ghibli. The Japanese animation studio, founded by the legendary eighty-four-year-old director Hayao Miyazaki, is known for its hand-drawn imagery, lushly organic color palettes, epic narratives, and evocation of both the emotional ambiguities of childhood and the twisting path to becoming an adult. We American millennials were blessed to have the films translated and distributed in English just as we were growing up, and so movies including "My Neighbor Totoro," "Princess Mononoke," and "Spirited Away" are nigh-universally recognizable touchstones of our youth. Any Ghibli imagery is primed to make us feel a combination of pleasurable nostalgia and mournful shivers, evoking the doomed forest creatures, greedy bathhouse ghosts, and missed connections featured in Miyazaki's cinematic story lines. Unfortunately, that sense of poignancy quickly erodes when you are bombarded with thousands of Ghibli-esque copycat images, as we all were online last week, thanks to OpenAI's latest version of its ChatGPT tool.


WikiVideo: Article Generation from Multiple Videos

Martin, Alexander, Kriz, Reno, Walden, William Gantt, Sanders, Kate, Recknor, Hannah, Yang, Eugene, Ferraro, Francis, Van Durme, Benjamin

arXiv.org Artificial Intelligence

We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.


I Bet You Did Not Mean That: Testing Semantic Importance via Betting

Teneggi, Jacopo, Sulam, Jeremias

arXiv.org Machine Learning

Recent works have extended notions of feature importance to \emph{semantic concepts} that are inherently interpretable to the users interacting with a black-box predictive model. Yet, precise statistical guarantees, such as false positive rate control, are needed to communicate findings transparently and to avoid unintended consequences in real-world scenarios. In this paper, we formalize the global (i.e., over a population) and local (i.e., for a sample) statistical importance of semantic concepts for the predictions of opaque models, by means of conditional independence, which allows for rigorous testing. We use recent ideas of sequential kernelized testing (SKIT) to induce a rank of importance across concepts, and showcase the effectiveness and flexibility of our framework on synthetic datasets as well as on image classification tasks using vision-language models such as CLIP.


HAMMR: HierArchical MultiModal React agents for generic VQA

Castrejon, Lluis, Mensink, Thomas, Zhou, Howard, Ferrari, Vittorio, Araujo, Andre, Uijlings, Jasper

arXiv.org Artificial Intelligence

Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical Multi-Modal React. We start from a multimodal ReAct-based [55] system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model [10] by 5.0%.


Exploiting Data Hierarchy as a New Modality for Contrastive Learning

Bhalla, Arjun, Levenson, Daniel, Bernhard, Jan, Abilov, Anton

arXiv.org Artificial Intelligence

This work investigates how hierarchically structured data can help neural networks learn conceptual representations of cathedrals. The underlying WikiScenes dataset provides a spatially organized hierarchical structure of cathedral components. We propose a novel hierarchical contrastive training approach that leverages a triplet margin loss to represent the data's spatial hierarchy in the encoder's latent space. As such, the proposed approach investigates if the dataset structure provides valuable information for self-supervised learning. We apply t-SNE to visualize the resultant latent space and evaluate the proposed approach by comparing it with other dataset-specific contrastive learning methods using a common downstream classification task. The proposed method outperforms the comparable weakly-supervised and baseline methods. Our findings suggest that dataset structure is a valuable modality for weakly-supervised learning.


Learning Structure-from-Motion with Graph Attention Networks

Brynte, Lucas, Iglesias, José Pedro, Olsson, Carl, Kahl, Fredrik

arXiv.org Artificial Intelligence

In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provides an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime.


DeepMind's CEO Helped Take AI Mainstream. Now He's Urging Caution

TIME - Tech

Demis Hassabis stands halfway up a spiral staircase, surveying the cathedral he built. The DNA sculpture, spanning three floors, is the centerpiece of DeepMind's recently opened London headquarters. It's an artistic representation of the code embedded in the nucleus of nearly every cell in the human body. "Although we work on making machines smart, we wanted to keep humanity at the center of what we're doing here," Hassabis, DeepMind's CEO and co-founder, tells TIME. This building, he says, is a "cathedral to knowledge." Each meeting room is named after a famous scientist or philosopher; we meet in the one dedicated to James Clerk Maxwell, the man who first theorized electromagnetic radiation. "I've always thought of DeepMind as an ode to intelligence," Hassabis says. Hassabis, 46, has always been obsessed with intelligence: what it is, the possibilities it unlocks, and how to acquire more of it.