baseball
Latent Chain-of-Thought for Visual Reasoning
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
ATaxonomy of Non-Strategic Microeconomics1029
We begin by characterizing the space of elements that test an agent's ability to optimally allocate1031 their limited resources to goods and services they desire. In economics and decision theory, the1032 most primitive approach to describing the preferences of decision-makers is to use a function that1033 maps a set of possible choices to the agent's optimal choice within that set. Under a set of intuitive1034 assumptions, such as transitivity (i.e., if bundle X is preferred to bundle Y, and Y is preferred to1035 bundle Z, then X must be preferred to Z), it becomes possible to "rationalize" preferences by instead1036 describing a utility function. This function assigns a real number to each bundle, and the agent selects1037 the bundle with the highest utility.1038 In this paper, we focus on these "rationalizable" preferences, where agent choice can be implemented1039 as utility maximization constrained by prices and income. The solution to these consumer choice1040 problems provides ...
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models
Large language models (LLMs) are increasingly being asked to make economically rational decisions and indeed are already being applied to economic tasks like stock picking and financial analysis. Existing LLM benchmarks tend to focus on specific applications, making them insufficient for characterizing economic reasoning more broadly. In previous work, we offered a blueprint for comprehensively benchmarking strategic decision-making Raman et al. [2024]. However, this work did not engage with the even larger microeconomic literature on non-strategic settings. We address this gap here, taxonomizing microeconomic reasoning into 58distinct elements, each grounded in up to 10distinct domains, 5perspectives, and 3types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. By generating fresh questions for each element, auto-STEER induces diversity which could help to reduce the risk of data contamination. We use this benchmark to evaluate 27LLMs spanning a range of scales and adaptation strategies, comparing performance across multiple formats--multiple-choice and free-text question answering--and scoring schemes. Our results surface systematic limitations in current LLMs' ability to generalize economic reasoning across types, formats, and textual perturbations, and establish a foundation for evaluating and improving economic competence in foundation models.
Assessing win strength in MLB win prediction models
In Major League Baseball, strategy and planning are major factors in determining the outcome of a game. Previous studies have aided this by building machine learning models for predicting the winning team of any given game. We extend this work by training a comprehensive set of machine learning models using a common dataset. In addition, we relate the win probabilities produced by these models to win strength as measured by score differential. In doing so we show that the most common machine learning models do indeed demonstrate a relationship between predicted win probability and the strength of the win. Finally, we analyze the results of using predicted win probabilities as a decision making mechanism on run-line betting. We demonstrate positive returns when utilizing appropriate betting strategies, and show that naive use of machine learning models for betting lead to significant loses.
Estrada signs with the Dodgers
The star pitcher has been studying aerospace engineering at MIT. Now his pitches will take flight in professional baseball. Like almost any MIT student, Mason Estrada wants to take what he learned on campus and apply it to the working world. Unlike any other current MIT student, Estrada's primary workplace is a pitcher's mound. Estrada, the star pitcher for MIT's baseball team, has signed a contract with the Los Angeles Dodgers, who selected him in the seventh round of the Major League Baseball draft on July 14. The right-hander, whose fastball has reached 96 miles per hour, is taking a leave of absence from the Institute and reported to the Dodgers' instructional camp in Arizona.
The study of short texts in digital politics: Document aggregation for topic modeling
Nakka, Nitheesha, Yalcin, Omer F., Desmarais, Bruce A., Rajtmajer, Sarah, Monroe, Burt
Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.
This could be baseball's last season without 'robot umpires'
If there's one thing baseball fans are averse to, it's change. Over the MLB's 149-year history, alterations to the game's rules, like lowering the pitcher's mound (1968) or introducing instant replay challenges (2014) came only after years of heated debate between reformers and purists. Maybe the most contentious issue ever to divide these two camps is whether or not to replace notoriously inaccurate human home plate umpires with less fallible machines. Though that was once largely considered out of the bounds of possibility, MLB games officiated by so-called "robot umpires" are now closer to reality than ever before. Starting this week, batters stepping up to the plate during spring training games will have the ability to challenge an umpire's pitch calls and have them immediately reviewed by a computer.
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
Lu, Pan, Chen, Bowen, Liu, Sheng, Thapa, Rahul, Boen, Joseph, Zou, James
Solving complex reasoning tasks may involve visual understanding, domain knowledge retrieval, numerical calculation, and multi-step reasoning. Existing methods augment large language models (LLMs) with external tools but are restricted to specialized domains, limited tool types, or require additional training data. In this paper, we introduce OctoTools, a training-free, user-friendly, and easily extensible open-source agentic framework designed to tackle complex reasoning across diverse domains. OctoTools introduces standardized tool cards to encapsulate tool functionality, a planner for both high-level and low-level planning, and an executor to carry out tool usage. We validate OctoTools' generality across 16 diverse tasks (including MathVista, MMLU-Pro, MedQA, and GAIA-Text), achieving substantial average accuracy gains of 9.3% over GPT-4o. Furthermore, OctoTools outperforms AutoGen, GPT-Functions and LangChain by up to 10.6% when given the same set of tools. Through comprehensive analysis and ablations, OctoTools demonstrates advantages in task planning, effective tool usage, and multi-step problem solving.
Addressing Topic Granularity and Hallucination in Large Language Models for Topic Modelling
Mu, Yida, Bai, Peizhen, Bontcheva, Kalina, Song, Xingyi
Large language models (LLMs) with their strong zero-shot topic extraction capabilities offer an alternative to probabilistic topic modelling and closed-set topic classification approaches. As zero-shot topic extractors, LLMs are expected to understand human instructions to generate relevant and non-hallucinated topics based on the given documents. However, LLM-based topic modelling approaches often face difficulties in generating topics with adherence to granularity as specified in human instructions, often resulting in many near-duplicate topics. Furthermore, methods for addressing hallucinated topics generated by LLMs have not yet been investigated. In this paper, we focus on addressing the issues of topic granularity and hallucinations for better LLM-based topic modelling. To this end, we introduce a novel approach that leverages Direct Preference Optimisation (DPO) to fine-tune open-source LLMs, such as Mistral-7B. Our approach does not rely on traditional human annotation to rank preferred answers but employs a reconstruction pipeline to modify raw topics generated by LLMs, thus enabling a fast and efficient training and inference framework. Comparative experiments show that our fine-tuning approach not only significantly improves the LLM's capability to produce more coherent, relevant, and precise topics, but also reduces the number of hallucinated topics.
Pose-free object classification from surface contact features in sequences of Robotic grasps
Alves, Teresa, Bernardino, Alexandre, Moreno, Plinio
In this work, we propose two cost efficient methods for object identification, using a multi-fingered robotic hand equipped with proprioceptive sensing. Both methods are trained on known objects and rely on a limited set of features, obtained during a few grasps on an object. Contrary to most methods in the literature, our methods do not rely on the knowledge of the relative pose between object and hand, which greatly expands the domain of application. However, if that knowledge is available, we propose an additional active exploration step that reduces the overall number of grasps required for a good recognition of the object. One of the methods depends on the contact positions and normals and the other depends on the contact positions alone. We test the proposed methods in the GraspIt! simulator and show that haptic-based object classification is possible in pose-free conditions. We evaluate the parameters that produce the most accurate results and require the least number of grasps for classification.