Goto

Collaborating Authors

 brier score



Incoherent Beliefs & Inconsistent Actions in Large Language Models

Pal, Arka, Kitanovski, Teo, Liang, Arthur, Potti, Akilesh, Goldblum, Micah

arXiv.org Artificial Intelligence

Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.


Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms

Xu, Shuoyan, Zhang, Yu, Miller, Eric J.

arXiv.org Artificial Intelligence

Abstract--Ride-hailing platforms are characterized by high-frequency, behavior-driven environments, such as shared mobility platforms. Although survival analysis has been widely applied to recurrent events in other domains, its use for modeling ride-hailing driver behavior remains largely unexplored. T o the best of our knowledge, this study is the first to formulate driver idle behavior as a recurrent survival process using large-scale platform data. This study proposes a survival analysis framework that uses a Transformer-based temporal encoder with causal masking to capture long-term temporal dependencies and embeds driver-specific embeddings to represent latent individual characteristics, significantly enhancing the personalized prediction of driver retention risk, modeling how historical idle sequences influence the current risk of leaving the platform via trip acceptance or log-off. The model is validated on datasets from the City of T oronto over the period January 2 to March 13, 2020. The results show that the proposed Frailty-A ware Cox Transformer (F ACT) delivers the highest time-dependent C-indices and the lowest Brier Scores across early, median, and late follow-up, demonstrating its robustness in capturing evolving risk over a driver's lifecycle. This study enables operators to optimize retention strategies and helps policy makers assess shared mobility's role in equitable and integrated transportation systems. The purpose of this study is to model the driver retention behavior through a transformer-based survival model. Shared mobility services, such as ride-hailing, car-sharing, and bike-sharing, are becoming an increasingly prominent component of contemporary transportation systems. These services are central to the broader concept of Mobility as a Service (MaaS) [1], which aims to integrate various forms of transport into a unified and user-centric platform.


Let the Experts Speak: Improving Survival Prediction & Calibration via Mixture-of-Experts Heads

Morrill, Todd, Puli, Aahlad, Megjhani, Murad, Park, Soojin, Zemel, Richard

arXiv.org Artificial Intelligence

Deep mixture-of-experts models have attracted a lot of attention for survival analysis problems, particularly for their ability to cluster similar patients together. In practice, grouping often comes at the expense of key metrics such as calibration error and predictive accuracy. This is due to the restrictive inductive bias that mixture-of-experts imposes, that predictions for individual patients must look like predictions for the group they're assigned to. Might we be able to discover patient group structure, where it exists, while improving calibration and predictive accuracy? In this work, we introduce several discrete-time deep mixture-of-experts (MoE)-based architectures for survival analysis problems, one of which achieves all desiderata: clustering, calibration, and predictive accuracy. We show that a key differentiator between this array of MoEs is how expressive their experts are. We find that more expressive experts that tailor predictions per patient outperform experts that rely on fixed group prototypes.


Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks

Rabaey, Paloma, Tench, Adrick, Heytens, Stefan, Demeester, Thomas

arXiv.org Artificial Intelligence

Electronic health records (EHRs) form an invaluable resource for training clinical decision support systems. To leverage the potential of such systems in high-risk applications, we need large, structured tabular datasets on which we can build transparent feature-based models. While part of the EHR already contains structured information (e.g. diagnosis codes, medications, and lab results), much of the information is contained within unstructured text (e.g. discharge summaries and nursing notes). In this work, we propose a method for multi-modal patient-level information extraction that leverages both the tabular features available in the patient's EHR (using an expert-informed Bayesian network) as well as clinical notes describing the patient's symptoms (using neural text classifiers). We propose the use of virtual evidence augmented with a consistency node to provide an interpretable, probabilistic fusion of the models' predictions. The consistency node improves the calibration of the final predictions compared to virtual evidence alone, allowing the Bayesian network to better adjust the neural classifier's output to handle missing information and resolve contradictions between the tabular and text data. We show the potential of our method on the SimSUM dataset, a simulated benchmark linking tabular EHRs with clinical notes through expert knowledge.





da18c47118a2d09926346f33bebde9f4-Paper-Conference.pdf

Neural Information Processing Systems

If diversity does in fact explain UQ/robustness improvements, this would suggest that deep ensembles indeed offer benefits that cannot be obtained by (standard) single neural networks. In this paper, we rigorously test hypotheses that formalize this intuition. Surprisingly, after controlling for factors related to the performance of an ensemble's component models, we find no evidence that having a diverse set of predictions is responsible for these purported benefits.


Revisiting the Calibration of Modern Neural Networks

Neural Information Processing Systems

These concerns are more relevant than ever, since the architecture size, amount of training data, and computing power used by state-of-the-art models continue to increase.