Goto

Collaborating Authors

 Media


Kaleidoscopic Teaming in Multi Agent Simulations

arXiv.org Artificial Intelligence

Warning: This paper contains content that may be inappropriate or offensive. AI agents have gained significant recent attention due to their autonomous tool usage capabilities and their integration in various real-world applications. This autonomy poses novel challenges for the safety of such systems, both in single- and multi-agent scenarios. We argue that existing red teaming or safety evaluation frameworks fall short in evaluating safety risks in complex behaviors, thought processes and actions taken by agents. Moreover, they fail to consider risks in multi-agent setups where various vulnerabilities can be exposed when agents engage in complex behaviors and interactions with each other. To address this shortcoming, we introduce the term kaleidoscopic teaming which seeks to capture complex and wide range of vulnerabilities that can happen in agents both in single-agent and multi-agent scenarios. We also present a new kaleidoscopic teaming framework that generates a diverse array of scenarios modeling real-world human societies. Our framework evaluates safety of agents in both single-agent and multi-agent setups. In single-agent setup, an agent is given a scenario that it needs to complete using the tools it has access to. In multi-agent setup, multiple agents either compete against or cooperate together to complete a task in the scenario through which we capture existing safety vulnerabilities in agents. We introduce new in-context optimization techniques that can be used in our kaleidoscopic teaming framework to generate better scenarios for safety analysis. Lastly, we present appropriate metrics that can be used along with our framework to measure safety of agents. Utilizing our kaleidoscopic teaming framework, we identify vulnerabilities in various models with respect to their safety in agentic use-cases.


Episode-specific Fine-tuning for Metric-based Few-shot Learners with Optimization-based Training

arXiv.org Artificial Intelligence

In few-shot classification tasks (so-called episodes), a small set of labeled support samples is provided during inference to aid the classification of unlabeled query samples. Metric-based models typically operate by computing similarities between query and support embeddings within a learned metric space, followed by nearest-neighbor classification. However, these labeled support samples are often underutilized--they are only used for similarity comparison, despite their potential to fine-tune and adapt the metric space itself to the classes in the current episode. To address this, we propose a series of simple yet effective episode-specific, during-inference fine-tuning methods for metric-based models, including Rotational Division Fine-Tuning (RDFT) and its two variants, Iterative Division Fine-Tuning (IDFT) and Augmented Division Fine-Tuning (ADFT). These methods construct pseudo support-query pairs from the given support set to enable fine-tuning even for non-parametric models. Nevertheless, the severely limited amount of data in each task poses a substantial risk of overfitting when applying such fine-tuning strategies. To mitigate this, we further propose to train the metric-based model within an optimization-based meta-learning framework. With the combined efforts of episode-specific fine-tuning and optimization-based meta-training, metric-based models are equipped with the ability to rapidly adapt to the limited support samples during inference while avoiding overfitting. We validate our approach on three audio datasets from diverse domains, namely ESC-50 (environmental sounds), Speech Commands V2 (spoken keywords), and Medley-solos-DB (musical instrument). Experimental results demonstrate that our approach consistently improves performance for all evaluated metric-based models (especially for attention-based models) and generalizes well across different audio domains.


Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown significant potential to change how we write, communicate, and create, leading to rapid adoption across society. This dissertation examines how individuals and institutions are adapting to and engaging with this emerging technology through three research directions. First, I demonstrate how the institutional adoption of AI detectors introduces systematic biases, particularly disadvantaging writers of non-dominant language varieties, highlighting critical equity concerns in AI governance. Second, I present novel population-level algorithmic approaches that measure the increasing adoption of LLMs across writing domains, revealing consistent patterns of AI-assisted content in academic peer reviews, scientific publications, consumer complaints, corporate communications, job postings, and international organization press releases. Finally, I investigate LLMs' capability to provide feedback on research manuscripts through a large-scale empirical analysis, offering insights into their potential to support researchers who face barriers in accessing timely manuscript feedback, particularly early-career researchers and those from under-resourced settings.


CORONA: A Coarse-to-Fine Framework for Graph-based Recommendation with Large Language Models

arXiv.org Artificial Intelligence

Recommender systems (RSs) are designed to retrieve candidate items a user might be interested in from a large pool. A common approach is using graph neural networks (GNNs) to capture high-order interaction relationships. As large language models (LLMs) have shown strong capabilities across domains, researchers are exploring their use to enhance recommendation. However, prior work limits LLMs to re-ranking results or dataset augmentation, failing to utilize their power during candidate filtering - which may lead to suboptimal performance. Instead, we propose to leverage LLMs' reasoning abilities during the candidate filtering process, and introduce Chain Of Retrieval ON grAphs (CORONA) to progressively narrow down the range of candidate items on interaction graphs with the help of LLMs: (1) First, LLM performs preference reasoning based on user profiles, with the response serving as a query to extract relevant users and items from the interaction graph as preference-assisted retrieval; (2) Then, using the information retrieved in the previous step along with the purchase history of target user, LLM conducts intent reasoning to help refine an even smaller interaction subgraph as intent-assisted retrieval; (3) Finally, we employ a GNN to capture high-order collaborative filtering information from the extracted subgraph, performing GNN-enhanced retrieval to generate the final recommendation results. The proposed framework leverages the reasoning capabilities of LLMs during the retrieval process, while seamlessly integrating GNNs to enhance overall recommendation performance. Extensive experiments on various datasets and settings demonstrate that our proposed CORONA achieves state-of-the-art performance with an 18.6% relative improvement in recall and an 18.4% relative improvement in NDCG on average.


Millions Use It Every Day. It's One of the Internet's Most Important Websites. Bots Are Destroying It, Piece by Piece.

Slate

Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. In the years since ChatGPT's debut transformed Silicon Valley into an artificial intelligence hype factory, the internet's most vibrant communities have puzzled over how to adapt to the ensuing deluge of A.I. slop, especially as autogenerated outputs become more sophisticated. Perhaps no platform exemplifies this conundrum better than Reddit, the anonymized message-board network that's been connecting millions of humans across the world for 20 years now--as many users there increasingly wonder whether they are, indeed, still connecting with other humans. Such concerns aren't new, but they've been heightened thanks to a shocking exercise of A.I.-powered manipulation. In late April, the moderation team for the popular subreddit r/ChangeMyView disclosed that researchers from the University of Zurich had conducted an "unauthorized experiment" on community members that "deployed AI-generated comments to study how AI could be used to change views."


John Oliver on AI slop: 'Some of this stuff is potentially very dangerous'

The Guardian

John Oliver covered the dangers of AI on his weekly HBO show, calling it "worryingly corrosive" for society. On Last Week Tonight, Oliver said that the "spread of AI generation tools has made it very easy to flood social media sites with cheap, professional-looking, often deeply weird content" using the term AI slop to describe it all. He referred to it as the "newest iteration of spam" with weird images and videos flooding people's feeds, with some people having "absolutely no idea that it isn't real". Oliver said that it was "extremely likely that we are gonna be drowning in this shit for the foreseeable future". With content such as this, "the whole point is to grab your attention" and given how easy it has become to make it, the barrier of entry has been reduced. Meta has not only joined the game with its own tool but it has also tweaked the algorithm meaning that more than a third of content in your feed is now from accounts you don't follow.


The Amazonification of Everything, Now as a Video Game

The Atlantic - Technology

Amazon delivery can be tough, unglamorous work. Workers must often reckon with complicated geography, demanding bosses, ever more biblical weather, and schedules that force time-conscious drivers to urinate in bottles. Surprising, then, that this is effectively the role in which one of the year's most anticipated video games casts the player. In Death Stranding 2, you arrange packages into swaying towers on your back, nudge the controller's left- and right-shoulder buttons to keep your weight balanced as you trip down rocky hills, and incur financial penalties for scuffing the merchandise if you take a tumble. The premise is a long trek from the super-soldier games, such as Call of Duty and Helldivers, that dominate the sales charts--even if you must occasionally battle the odd spectral marauder from a parallel dimension to clear the way to the next address on your delivery sheet.


'We were all pretty privileged': Allison Williams on Girls, nepo babies and toxic momfluencers

The Guardian

If you had wandered the set of the film M3gan 2.0 last year, chances are you would have stumbled into M3gan, the terrifying humanoid doll, staring lifelessly while she waited to be called for her next scene. Sometimes she would stand in the corner of the soundstage, says Allison Williams with a nervy laugh. "The dilemma is: do you turn her around so she's facing the wall, or do you let her face the room? In the sequel to the sci-fi horror M3gan, Williams resumes her role as Gemma, a roboticist who has become a crusader against rampant and reckless AI development after her creation โ€“ developed for her orphaned niece โ€“ became murderous. Acting opposite M3gan was unsettling, says Williams, speaking over a video call from a hotel room in New York. Sometimes she was played by the 15-year-old dancer Amie Donald, but often she was a robotic doll, animated by a small team. "When she's been working for a while, her eyelids can get sticky," says Williams. M3gan's handlers would paint lubricant on to her eyeballs with a brush and Williams would have to catch herself: "She's not flinching and for a second you're like: 'Ugh.' Then you remember: this is not a live thing." Still best known for her first role as Marnie in Lena Dunham's landmark TV series Girls, Williams has gravitated towards comedy-tinged horror in recent years. Her first post-Girls film role was in the Oscar-winning dark comedy horror Get Out. It and M3gan were relatively low-budget projects that became cultural phenomena โ€“ Get Out for its commentary on racial politics, M3gan for what it says about the dangers of AI (as well as the uncanniness of M3gan herself). Williams has long been interested in AI โ€“ she knows Sam Altman, the co-founder and CEO of OpenAI, which created ChatGPT, who put her in touch with robotics experts when she was researching the role of Gemma. The film raises questions not only about the danger of rogue AI, but about the ethical concerns โ€“including how we should feel about the "rights" of devices. "It's easy to imbue anything that has AI in it with humanity.


Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization

arXiv.org Artificial Intelligence

Video synchronization-aligning multiple video streams capturing the same event from different angles-is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems. Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings where such signals may be unreliable or absent. Additionally, existing benchmarks for video synchronization lack generality and reproducibility, restricting progress in the field. In this work, we introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods, such as human pose estimation, enabling broader applicability across different content types. We evaluate our system on newly composed datasets covering single-human, multi-human, and non-human scenarios, providing both the methodology and code for dataset creation to establish reproducible benchmarks. Our analysis reveals biases in prior SOTA work, particularly in SeSyn-Net's preprocessing pipeline, leading to inflated performance claims. We correct these biases and propose a more rigorous evaluation framework, demonstrating that VideoSync outperforms existing approaches, including SeSyn-Net, under fair experimental conditions. Additionally, we explore various synchronization offset prediction methods, identifying a convolutional neural network (CNN)-based model as the most effective. Our findings advance video synchronization beyond domain-specific constraints, making it more generalizable and robust for real-world applications.


TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models

arXiv.org Artificial Intelligence

As Large Language Models (LLMs) become increasingly integrated into real-world, autonomous applications, relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers. Unlike conventional metrics that compare to fixed references or depend solely on LLM-as-a-judge knowledge, TALE employs an agent with tool-access capabilities that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By shifting away from static references, TALE aligns with free-form question-answering tasks common in real-world scenarios. Experimental results on multiple free-form QA benchmarks show that TALE not only outperforms standard reference-based metrics for measuring response accuracy but also achieves substantial to near-perfect agreement with human evaluations. TALE enhances the reliability of LLM evaluations in real-world, dynamic scenarios without relying on static references.