brier score
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Incoherent Beliefs & Inconsistent Actions in Large Language Models
Pal, Arka, Kitanovski, Teo, Liang, Arthur, Potti, Akilesh, Goldblum, Micah
Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms
Xu, Shuoyan, Zhang, Yu, Miller, Eric J.
Abstract--Ride-hailing platforms are characterized by high-frequency, behavior-driven environments, such as shared mobility platforms. Although survival analysis has been widely applied to recurrent events in other domains, its use for modeling ride-hailing driver behavior remains largely unexplored. T o the best of our knowledge, this study is the first to formulate driver idle behavior as a recurrent survival process using large-scale platform data. This study proposes a survival analysis framework that uses a Transformer-based temporal encoder with causal masking to capture long-term temporal dependencies and embeds driver-specific embeddings to represent latent individual characteristics, significantly enhancing the personalized prediction of driver retention risk, modeling how historical idle sequences influence the current risk of leaving the platform via trip acceptance or log-off. The model is validated on datasets from the City of T oronto over the period January 2 to March 13, 2020. The results show that the proposed Frailty-A ware Cox Transformer (F ACT) delivers the highest time-dependent C-indices and the lowest Brier Scores across early, median, and late follow-up, demonstrating its robustness in capturing evolving risk over a driver's lifecycle. This study enables operators to optimize retention strategies and helps policy makers assess shared mobility's role in equitable and integrated transportation systems. The purpose of this study is to model the driver retention behavior through a transformer-based survival model. Shared mobility services, such as ride-hailing, car-sharing, and bike-sharing, are becoming an increasingly prominent component of contemporary transportation systems. These services are central to the broader concept of Mobility as a Service (MaaS) [1], which aims to integrate various forms of transport into a unified and user-centric platform.
- North America > Canada > Ontario > Toronto (0.06)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > India (0.04)
- Asia > China (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
Let the Experts Speak: Improving Survival Prediction & Calibration via Mixture-of-Experts Heads
Morrill, Todd, Puli, Aahlad, Megjhani, Murad, Park, Soojin, Zemel, Richard
Deep mixture-of-experts models have attracted a lot of attention for survival analysis problems, particularly for their ability to cluster similar patients together. In practice, grouping often comes at the expense of key metrics such as calibration error and predictive accuracy. This is due to the restrictive inductive bias that mixture-of-experts imposes, that predictions for individual patients must look like predictions for the group they're assigned to. Might we be able to discover patient group structure, where it exists, while improving calibration and predictive accuracy? In this work, we introduce several discrete-time deep mixture-of-experts (MoE)-based architectures for survival analysis problems, one of which achieves all desiderata: clustering, calibration, and predictive accuracy. We show that a key differentiator between this array of MoEs is how expressive their experts are. We find that more expressive experts that tailor predictions per patient outperform experts that rely on fixed group prototypes.
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Health & Medicine > Health Care Technology (0.67)
- Health & Medicine > Therapeutic Area > Nephrology (0.46)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)
- Health & Medicine > Diagnostic Medicine > Imaging (0.46)
Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks
Rabaey, Paloma, Tench, Adrick, Heytens, Stefan, Demeester, Thomas
Electronic health records (EHRs) form an invaluable resource for training clinical decision support systems. To leverage the potential of such systems in high-risk applications, we need large, structured tabular datasets on which we can build transparent feature-based models. While part of the EHR already contains structured information (e.g. diagnosis codes, medications, and lab results), much of the information is contained within unstructured text (e.g. discharge summaries and nursing notes). In this work, we propose a method for multi-modal patient-level information extraction that leverages both the tabular features available in the patient's EHR (using an expert-informed Bayesian network) as well as clinical notes describing the patient's symptoms (using neural text classifiers). We propose the use of virtual evidence augmented with a consistency node to provide an interpretable, probabilistic fusion of the models' predictions. The consistency node improves the calibration of the final predictions compared to virtual evidence alone, allowing the Bayesian network to better adjust the neural classifier's output to handle missing information and resolve contradictions between the tabular and text data. We show the potential of our method on the SimSUM dataset, a simulated benchmark linking tabular EHRs with clinical notes through expert knowledge.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Asia > Middle East > Jordan (0.04)
- (17 more...)
- Education (0.46)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
- Asia > Middle East > Israel (0.05)
- Asia > Middle East > Iran (0.04)
- North America > United States > California (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Overview (0.92)
- Leisure & Entertainment (1.00)
- Health & Medicine (1.00)
- Government > Voting & Elections (1.00)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
- North America > United States > California (0.04)
- North America > Canada (0.04)
da18c47118a2d09926346f33bebde9f4-Paper-Conference.pdf
If diversity does in fact explain UQ/robustness improvements, this would suggest that deep ensembles indeed offer benefits that cannot be obtained by (standard) single neural networks. In this paper, we rigorously test hypotheses that formalize this intuition. Surprisingly, after controlling for factors related to the performance of an ensemble's component models, we find no evidence that having a diverse set of predictions is responsible for these purported benefits.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)