Media
The PLLuM Instruction Corpus
Pęzik, Piotr, Żarnecki, Filip, Kaczyński, Konrad, Cichosz, Anna, Deckert, Zuzanna, Garnys, Monika, Grabarczyk, Izabela, Janowski, Wojciech, Karasińska, Sylwia, Kujawiak, Aleksandra, Misztela, Piotr, Szymańska, Maria, Walkusz, Karolina, Siek, Igor, Chrabąszcz, Maciej, Kołos, Anna, Karlińska, Agnieszka, Seweryn, Karolina, Krasnodębska, Aleksandra, Betscher, Paula, Cieślińska, Zofia, Kowol, Katarzyna, Wilczek, Artur, Trzciński, Maciej, Dziewulska, Katarzyna, Roszko, Roman, Bernaś, Tomasz, Vaičenonienė, Jurgita, Roszko, Danuta, Levchuk, Paweł, Kowalski, Paweł, Prawdzic-Jankowska, Irena, Kozłowski, Marek, Dadas, Sławomir, Poświata, Rafał, Wróblewska, Alina, Krasnowska-Kieraś, Katarzyna, Ogrodniczuk, Maciej, Rudolf, Michał, Rybak, Piotr, Saputa, Karolina, Wołoszyn, Joanna, Oleksy, Marcin, Koptyra, Bartłomiej, Ferdinan, Teddy, Woźniak, Stanisław, Piasecki, Maciej, Walkowiak, Paweł, Wojtasik, Konrad, Janz, Arkadiusz, Kazienko, Przemysław, Moska, Julia, Kocoń, Jan
This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.
Device-Guided Music Transfer
Hung, Manh Pham, Hu, Changshuo, Dang, Ting, Ma, Dong
Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker's frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.
Do Vision-Language Models Understand Visual Persuasiveness?
Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.
ARQUSUMM: Argument-aware Quantitative Summarization of Online Conversations
Tang, An Quang, Zhang, Xiuzhen, Dinh, Minh Ngoc, Li, Zhuang
Online conversations have become more prevalent on public discussion platforms (e.g. Reddit). With growing controversial topics, it is desirable to summarize not only diverse arguments, but also their rationale and justification. Early studies on text summarization focus on capturing general salient information in source documents, overlooking the argumentative nature of online conversations. Recent research on conversation summarization although considers the argumentative relationship among sentences, fail to explicate deeper argument structure within sentences for summarization. In this paper, we propose a novel task of argument-aware quantitative summarization to reveal the claim-reason structure of arguments in conversations, with quantities measuring argument strength. We further propose ARQUSUMM, a novel framework to address the task. To reveal the underlying argument structure within sentences, ARQUSUMM leverages LLM few-shot learning grounded in the argumentation theory to identify propositions within sentences and their claim-reason relationships. For quantitative summarization, ARQUSUMM employs argument structure-aware clustering algorithms to aggregate arguments and quantify their support. Experiments show that ARQUSUMM outperforms existing conversation and quantitative summarization models and generate summaries representing argument structures that are more helpful to users, of high textual quality and quantification accuracy.
Generative Augmented Reality: Paradigms, Technologies, and Future Applications
Liang, Chen, Zheng, Jiawen, Zeng, Yufeng, Tan, Yi, Lyu, Hengye, Zheng, Yuhui, Li, Zisu, Weng, Yueting, Shi, Jiaxin, Zhang, Hanwang
This paper introduces Generative Augmented Reality (GAR) as a next-generation paradigm that reframes augmentation as a process of world re-synthesis rather than world composition by a conventional AR engine. GAR replaces the conventional AR engine's multi-stage modules with a unified generative backbone, where environmental sensing, virtual content, and interaction signals are jointly encoded as conditioning inputs for continuous video generation. We formalize the computational correspondence between AR and GAR, survey the technical foundations that make real-time generative augmentation feasible, and outline prospective applications that leverage its unified inference model. We envision GAR as a future AR paradigm that delivers high-fidelity experiences in terms of realism, interactivity, and immersion, while eliciting new research challenges on technologies, content ecosystems, and the ethical and societal implications.
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Tseng, Wei-Cheng, Zhou, Xuanru, Huo, Mingyue, Shao, Yiwen, Zhang, Hao, Yu, Dong
Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding. Early advances relied on supervised learning, where models trained on labeled corpora were adapted to related downstream tasks or transferred across domains (Kong et al., 2020; Chen et al., 2022a; Snyder et al., 2018; Desplanques et al., 2020).
AI use in American newspapers is widespread, uneven, and rarely disclosed
Russell, Jenna, Karpinska, Marzena, Akinode, Destiny, Thai, Katherine, Emi, Bradley, Spero, Max, Iyyer, Mohit
AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
ISS-Geo142: A Benchmark for Geolocating Astronaut Photography from the International Space Station
Srivastava, Vedika, Singh, Hemant Kumar, Singh, Jaisal
This paper introduces ISS-Geo142, a curated benchmark for geolocating astronaut photography captured from the International Space Station (ISS). Although the ISS position at capture time is known precisely, the specific Earth locations depicted in these images are typically not directly georeferenced, making automated localization non-trivial. ISS-Geo142 consists of 142 images with associated metadata and manually determined geographic locations, spanning a range of spatial scales and scene types. On top of this benchmark, we implement and evaluate three geolocation pipelines: a neural network based approach (NN-Geo) using VGG16 features and cross-correlation over map-derived Areas of Interest (AOIs), a Scale-Invariant Feature Transform based pipeline (SIFT-Match) using sliding-window feature matching on stitched high-resolution AOIs, and TerraByte, an AI system built around a GPT-4 model with vision capabilities that jointly reasons over image content and ISS coordinates. On ISS-Geo142, NN-Geo achieves a match for 75.52\% of the images under our evaluation protocol, SIFT-Match attains high precision on structurally rich scenes at substantial computational cost, and TerraByte establishes the strongest overall baseline, correctly geolocating approximately 90\% of the images while also producing human-readable geographic descriptions. The methods and experiments were originally developed in 2023; this manuscript is a revised and extended version that situates the work relative to subsequent advances in cross-view geo-localization and remote-sensing vision--language models. Taken together, ISS-Geo142 and these three pipelines provide a concrete, historically grounded benchmark for future work on ISS image geolocation.