Goto

Collaborating Authors

 Media


Fraud-Proof Revenue Division on Subscription Platforms

arXiv.org Artificial Intelligence

We study a model of subscription-based platforms where users pay a fixed fee for unlimited access to content, and creators receive a share of the revenue. Existing approaches to detecting fraud predominantly rely on machine learning methods, engaging in an ongoing arms race with bad actors. We explore revenue division mechanisms that inherently disincentivize manipulation. We formalize three types of manipulation-resistance axioms and examine which existing rules satisfy these. We show that a mechanism widely used by streaming platforms, not only fails to prevent fraud, but also makes detecting manipulation computationally intractable. We also introduce a novel rule, ScaledUserProp, that satisfies all three manipulation-resistance axioms. Finally, experiments with both real-world and synthetic streaming data support ScaledUserProp as a fairer alternative compared to existing rules.


Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

arXiv.org Artificial Intelligence

ABSTRACT V ariational Autoencoders (V AEs) are essential for large-scale audio tasks like diffusion-based generation. To address these challenges, we propose ฯตar-V AE, an open-source music signal reconstruction model that rethinks and optimizes the V AE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. Experiments show ฯตar-V AE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics. Index T erms-- V AE, Music, Phase, Perceptual Weighting 1. INTRODUCTION Achieving perfect, perceptually lossless reconstruction of complex audio signals like music remains a central challenge in audio engineering and machine learning.


Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models

arXiv.org Artificial Intelligence

Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.


Two Causally Related Needles in a Video Haystack

arXiv.org Artificial Intelligence

Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal description of a visual detail from that video clip. Our experiments reveal that models excelling on existing benchmarks struggle with causal 2-needle questions, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs. The dataset is available at: https://huggingface.co/datasets/causal2needles/Causal2Needles


Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

arXiv.org Artificial Intelligence

In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.


MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

arXiv.org Artificial Intelligence

We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.


The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013--2023

arXiv.org Artificial Intelligence

Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard's Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.


GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation

arXiv.org Artificial Intelligence

Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination rates, and 6.9$\%$ greater correctness compared to greedy decoding, while attaining the highest truth--informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.


Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

arXiv.org Artificial Intelligence

Social and collaborative platforms emit multivariate time-series traces in which early interactions--such as views, likes, or downloads--are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets--arXiv (accesses citations of 2.3M papers) and GitHub (pushes/stars forks of 3M repositories)--and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page-views edits), Spotify (streams concert attendance), e-commerce (click-throughs purchases), and LinkedIn profile (views messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding sur-vivorship bias in sampling. We documented all technical details of data cura-tion and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. The success of human activities is often measured by their collective impact, ranging from music streams and movie box office revenues to product sales and social media popularity. These impact metrics typically follow heavy-tailed distributions (Clauset et al., 2009) and slow decay patterns across timescales (Candia et al., 2019), making early identification of future hits fundamentally challenging (Cheng et al., 2014; Martin et al., 2016). At the same time, digital platforms increasingly log online user interactions--searches, views, downloads, likes, and shares--that often precede these long-term dynamics. These temporal lead-lag dynamics are remarkably ubiquitous, spanning domains as diverse as science (Haque & Ginsparg, 2009), economics (Wu & Brynjolfsson, 2015), arts (Goel et al., 2010), culture (Gruhl et al., 2005), and social movements (Johnson et al., 2016). A systematic understanding of such lead-lag dynamics is not only crucial for anticipating and optimizing impact in digital ecosystems, but also essential for designing effective strategies that identify and promote emerging innovations and products.