Goto

Collaborating Authors

 Media


The Newspaper That Hired ChatGPT

The Atlantic - Technology

For more than 20 years, print media has been a bit of a punching bag for digital-technology companies. Craigslist killed the paid classifieds, free websites led people to think newspapers and magazines were committing robbery when they charged for subscriptions, and the smartphone and social media turned reading full-length articles into a chore. Now generative AI is in the mix--and many publishers, desperate to avoid being left behind once more, are rushing to harness the technology themselves. Several major publications, including The Atlantic, have entered into corporate partnerships with OpenAI and other AI firms. Any number of experiments have ensued--publishers have used the software to help translate work into different languages, draft headlines, and write summaries or even articles.


A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations

arXiv.org Artificial Intelligence

Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.


GottBERT: a pure German Language Model

arXiv.org Artificial Intelligence

Lately, pre-trained language models advanced the field of natural language processing (NLP). The introduction of Bidirectional Encoders for Transformers (BERT) and its optimized version RoBERTa have had significant impact and increased the relevance of pre-trained models. First, research in this field mainly started on English data followed by models trained with multilingual text corpora. However, current research shows that multilingual models are inferior to monolingual models. Currently, no German single language RoBERTa model is yet published, which we introduce in this work (GottBERT). The German portion of the OSCAR data set was used as text corpus. In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones. GottBERT was pre-trained related to the original RoBERTa model using fairseq. All downstream tasks were trained using hyperparameter presets taken from the benchmark of German BERT. The experiments were setup utilizing FARM. Performance was measured by the $F_{1}$ score. GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture. Even without extensive hyper-parameter optimization, in all NER and one text classification task, GottBERT already outperformed all other tested German and multilingual models. In order to support the German NLP field, we publish GottBERT under the AGPLv3 license.


Graph-MLLM: Harnessing Multimodal Large Language Models for Multimodal Graph Learning

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities. However, they typically focus on modality alignment in a pairwise manner while overlooking structural relationships across data points. Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applications such as social networks, healthcare, and recommendation systems. Existing MMG learning methods fall into three paradigms based on how they leverage MLLMs: Encoder, Aligner, and Predictor. MLLM-as-Encoder focuses on enhancing graph neural networks (GNNs) via multimodal feature fusion; MLLM-as-Aligner aligns multimodal attributes in language or hidden space to enable LLM-based graph reasoning; MLLM-as-Predictor treats MLLMs as standalone reasoners with in-context learning or fine-tuning. Despite their advances, the MMG field lacks a unified benchmark to fairly evaluate across these approaches, making it unclear what progress has been made. To bridge this gap, we present Graph-MLLM, a comprehensive benchmark for multimodal graph learning by systematically evaluating these three paradigms across six datasets with different domains. Through extensive experiments, we observe that jointly considering the visual and textual attributes of the nodes benefits graph learning, even when using pre-trained text-to-image alignment models (e.g., CLIP) as encoders. We also find that converting visual attributes into textual descriptions further improves performance compared to directly using visual inputs. Moreover, we observe that fine-tuning MLLMs on specific MMGs can achieve state-of-the-art results in most scenarios, even without explicit graph structure information. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field.


Classifying Unreliable Narrators with Large Language Models

arXiv.org Artificial Intelligence

Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.


Fine-Grained control over Music Generation with Activation Steering

arXiv.org Artificial Intelligence

--We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner . We observe that modelling this as a regression task provides improved performance, hypothesizing that the mean-squared-error better preserve meaningful directional information in the activation space. Combined with the global conditioning offered by text prompts in MusicGen, our method provides both global and local control over music generation. Audio samples illustrating our method are available at our demo page.


Cybernetic Marionette: Channeling Collective Agency Through a Wearable Robot in a Live Dancer-Robot Duet

arXiv.org Artificial Intelligence

We describe DANCE^2, an interactive dance performance in which audience members channel their collective agency into a dancer-robot duet by voting on the behavior of a wearable robot affixed to the dancer's body. At key moments during the performance, the audience is invited to either continue the choreography or override it, shaping the unfolding interaction through real-time collective input. While post-performance surveys revealed that participants felt their choices meaningfully influenced the performance, voting data across four public performances exhibited strikingly consistent patterns. This tension between what audience members do, what they feel, and what actually changes highlights a complex interplay between agentive behavior, the experience of agency, and power. We reflect on how choreography, interaction design, and the structure of the performance mediate this relationship, offering a live analogy for algorithmically curated digital systems where agency is felt, but not exercised.


VINCIE: Unlocking In-context Image Editing from Video

arXiv.org Artificial Intelligence

In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.


The Role of Generative AI in Facilitating Social Interactions: A Scoping Review

arXiv.org Artificial Intelligence

Reduced social connectedness increasingly poses a threat to mental health, life expectancy, and general well-being. Generative AI (GAI) technologies, such as large language models (LLMs) and image generation tools, are increasingly integrated into applications aimed at enhancing human social experiences. Despite their growing presence, little is known about how these technologies influence social interactions. This scoping review investigates how GAI-based applications are currently designed to facilitate social interaction, what forms of social engagement they target, and which design and evaluation methodologies designers use to create and evaluate them. Through an analysis of 30 studies published since 2020, we identify key trends in application domains including storytelling, socio-emotional skills training, reminiscence, collaborative learning, music making, and general conversation. We highlight the role of participatory and co-design approaches in fostering both effective technology use and social engagement, while also examining socio-ethical concerns such as cultural bias and accessibility. This review underscores the potential of GAI to support dynamic and personalized interactions, but calls for greater attention to equitable design practices and inclusive evaluation strategies.


Contrastive Matrix Completion with Denoising and Augmented Graph Views for Robust Recommendation

arXiv.org Artificial Intelligence

Matrix completion is a widely adopted framework in recommender systems, as predicting the missing entries in the user-item rating matrix enables a comprehensive understanding of user preferences. However, current graph neural network (GNN)-based approaches are highly sensitive to noisy or irrelevant edges--due to their inherent message-passing mechanisms--and are prone to overfitting, which limits their generalizability. To overcome these challenges, we propose a novel method called Matrix Completion using Contrastive Learning (MCCL). Our approach begins by extracting local neighborhood subgraphs for each interaction and subsequently generates two distinct graph representations. The first representation emphasizes denoising by integrating GNN layers with an attention mechanism, while the second is obtained via a graph variational autoencoder that aligns the feature distribution with a standard prior. A mutual learning loss function is employed during training to gradually harmonize these representations, enabling the model to capture common patterns and significantly enhance its generalizability. Extensive experiments on several real-world datasets demonstrate that our approach not only improves the numerical accuracy of the predicted scores--achieving up to a 0.8% improvement in RMSE--but also produces superior rankings with improvements of up to 36% in ranking metrics.