Paruchuri, Akshay
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
Paruchuri, Akshay, Garrison, Jake, Liao, Shun, Hernandez, John, Sunshine, Jacob, Althoff, Tim, Liu, Xin, McDuff, Daniel
Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper, we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We evaluate three ways to provide context to LMs 1) anchoring examples from within a distribution or family of distributions, 2) real-world context, 3) summary statistics on which to base a Normal approximation. Models can make inferences about distributions, and can be further aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we will release publicly.
Transforming Wearable Data into Health Insights using Large Language Model Agents
Merrill, Mike A., Paruchuri, Akshay, Rezaei, Naghmeh, Kovacs, Geza, Perez, Javier, Liu, Yun, Schenck, Erik, Hammerquist, Nova, Sunshine, Jake, Tailor, Shyam, Ayush, Kumar, Su, Hao-Wei, He, Qian, McLean, Cory Y., Malhotra, Mark, Patel, Shwetak, Zhan, Jiening, Althoff, Tim, McDuff, Daniel, Liu, Xin
Personal health data, often derived from personal devices such as wearables, are distinguished by their multi-dimensional, continuous and longitudinal measurements that capture granular observations of physiology and behavior in-situ rather than in a clinical setting. Research studies have highlighted the significant health impacts of physical activity and sleep patterns, emphasizing the potential for wearable-derived data to reveal personalized health insights and promote positive behavior changes [1, 4, 30, 46, 47]. For example, individuals with a device-measured Physical Activity Energy Expenditure (PAEE) that is 5 kJ/kg/day higher had a 37% lower premature mortality risk [47]. Those with frequent sleep disturbances were associated with an increase in risk of hypertension, diabetes and cardiovascular diseases [9, 30]. A large meta-analysis suggests that activity trackers improve physical activity and promote weight loss, with users taking 1800 extra steps per day [16]. Despite these gross benefits, using wearable data to derive intelligent responses and insights to personal health queries is non-trivial. These data are usually collected without clinical supervision and users often do not have access to the expertise that could aid in data interpretation. For example, a common question of wearable device users is "How can I get better sleep?". Though a seemingly straightforward question, arriving at an ideal response would involve performing a series of complex, independent analytical steps across multiple irregularly sampled time series such as: checking the availability of recent data, deciding on metrics to optimize (e.g.