Goto

Collaborating Authors

 basketball




Inferring Event Descriptions from Time Series with Language Models

Tan, Mingtian, Merrill, Mike A., Gottesman, Zack, Althoff, Tim, Evans, David, Hartvigsen, Tom

arXiv.org Artificial Intelligence

Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. When analyzing time series, we often seek to understand the underlying events occurring in the measured environment. For example, one might ask: What caused a sharp drop in the stock price? Events are often described with natural language, so we conduct the first study of whether Large Language Models (LLMs) can infer natural language events from time series. We curate a new benchmark featuring win probabilities collected from 4,200 basketball and American football games, featuring 1.7M timesteps with real value data and corresponding natural language events. Building on the recent wave of using LLMs on time series, we evaluate 16 LLMs and find that they demonstrate promising abilities to infer events from time series data. The open-weights DeepSeek-R1 32B model outperforms proprietary models like GPT-4o. Despite this impressive initial performance, we also find clear avenues to improve recent models, as we identify failures when altering the provided context, event sequence lengths, and evaluation strategy. (All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime)


The study of short texts in digital politics: Document aggregation for topic modeling

Nakka, Nitheesha, Yalcin, Omer F., Desmarais, Bruce A., Rajtmajer, Sarah, Monroe, Burt

arXiv.org Artificial Intelligence

Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.


Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Wang, Yiping, He, Xuehai, Wang, Kuan, Ma, Luyao, Yang, Jianwei, Wang, Shuohang, Du, Simon Shaolei, Shen, Yelong

arXiv.org Artificial Intelligence

The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.


Adversarial Circuit Evaluation

de Bos, Niels uit, Garriga-Alonso, Adrià

arXiv.org Artificial Intelligence

Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.


Mathematical models for off-ball scoring prediction in basketball

Kono, Rikako, Fujii, Keisuke

arXiv.org Artificial Intelligence

In professional basketball, the accurate prediction of scoring opportunities based on strategic decision-making is crucial for space and player evaluations. However, traditional models often face challenges in accounting for the complexities of off-ball movements, which are essential for accurate predictive performance. In this study, we propose two mathematical models to predict off-ball scoring opportunities in basketball, considering both pass-to-score and dribble-to-score movements: the Ball Movement for Off-ball Scoring (BMOS) and the Ball Intercept and Movement for Off-ball Scoring (BIMOS) models. The BMOS adapts principles from the Off-Ball Scoring Opportunities (OBSO) model, originally designed for soccer, to basketball, whereas the BIMOS also incorporates the likelihood of interception during ball movements. We evaluated these models using player tracking data from 630 NBA games in the 2015-2016 regular season, demonstrating that the BIMOS outperforms the BMOS in terms of scoring prediction accuracy. Thus, our models provide valuable insights for tactical analysis and player evaluation in basketball.


Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

Varshney, Neeraj, Raj, Satyam, Mishra, Venkatesh, Chatterjee, Agneet, Sarkar, Ritika, Saeidi, Amir, Baral, Chitta

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, they have been shown to suffer from a critical limitation pertinent to 'hallucination' in their output. Recent research has focused on investigating and addressing this problem for a variety of tasks such as biography generation, question answering, abstractive summarization, and dialogue generation. However, the crucial aspect pertaining to 'negation' has remained considerably underexplored. Negation is important because it adds depth and nuance to the understanding of language and is also crucial for logical reasoning and inference. In this work, we address the above limitation and particularly focus on studying the impact of negation in LLM hallucinations. Specifically, we study four tasks with negation: 'false premise completion', 'constrained fact generation', 'multiple choice question answering', and 'fact generation'. We show that open-source state-of-the-art LLMs such as LLaMA-2-chat, Vicuna, and Orca-2 hallucinate considerably on all these tasks involving negation which underlines a critical shortcoming of these models. Addressing this problem, we further study numerous strategies to mitigate these hallucinations and demonstrate their impact.


Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Chughtai, Bilal, Cooney, Alan, Nanda, Neel

arXiv.org Artificial Intelligence

How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several distinct, independent, and qualitatively different mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the additive motif: models compute through summing up multiple independent contributions. Each mechanism's contribution may be insufficient alone, but summing results in constructive interfere on the correct answer. In addition, we extend the method of direct logit attribution to attribute an attention head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates from different source tokens.


Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models

Yousefi, Safoora, Betthauser, Leo, Hasanbeig, Hosein, Millière, Raphaël, Momennejad, Ida

arXiv.org Artificial Intelligence

Large language models (LLMs) exhibit remarkable performance improvement through in-context learning (ICL) by leveraging task-specific examples in the input. However, the mechanisms behind this improvement remain elusive. In this work, we investigate how LLM embeddings and attention representations change following in-context-learning, and how these changes mediate improvement in behavior. We employ neuroscience-inspired techniques such as representational similarity analysis (RSA) and propose novel methods for parameterized probing and measuring ratio of attention to relevant vs. irrelevant information in Llama-2 70B and Vicuna 13B. We designed two tasks with a priori relationships among their conditions: linear regression and reading comprehension. We formed hypotheses about expected similarities in task representations and measured hypothesis alignment of LLM representations before and after ICL as well as changes in attention. Our analyses revealed a meaningful correlation between improvements in behavior after ICL and changes in both embeddings and attention weights across LLM layers. This empirical framework empowers a nuanced understanding of how latent representations shape LLM behavior, offering valuable tools and insights for future research and practical applications.