empirical evaluation
Response to Reviewer 2: Empirical evaluation: Interestingly, we actually did an empirical evaluation in the earlier
We thank the reviewers for the positive feedback and their interest in our work! Below we address some questions. Both algorithms are well-tuned for hyperparameters. We didn't include it in the submission because after all the We will make sure to define them earlier in the paper in the revision. We are happy to clarify them.
Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks
Wang, Kevin, Moktar, Subre Abdoul, Li, Jia, Li, Kangshuo, Chen, Feng
Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model's epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.
- North America > United States > Texas > Dallas County > Richardson (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
38db3aed920cf82ab059bfccbd02be6a-Reviews.html
It is know that adding an additive gaussian noise to the feature is equivalent to an l_2 regularization in a least square problem (Bishop). This paper studies multiplicative Bernoulli feature noising, in a shallow learning architecture, with a general loss function and shows that it has the effect of adapting the geometry through an l_2 regularizer that rescales the feature (beta^{\top} D(beta,X) beta). The Matrix D(beta,X) is a estimate of the inverse diagonal fisher information. It is worth noting that D does not depend on the labels. The equivalent regularizer of dropout is non convex in general.
- Summary/Review (0.48)
- Research Report > New Finding (0.30)
3493894fa4ea036cfc6433c3e2ee63b0-Reviews.html
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper proposes an approach to stochastic multi-objective optimization. The main idea is simply described: optimize a single objective while taking other objectives as constraints. The authors proposes a primal-dual stochastic optimization algorithm to solve the problem and prove that it achieves (for the primal objective) the optimal 1/\sqrt{T} convergence rate. As far as I am concerned, the theory is solid and it does provide a good insight into the problem of interest.
0deb1c54814305ca9ad266f53bc82511-Reviews.html
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper proposes and analyzes a method for learning in robust MDPs. While this setting is very similar to learning in stochastic games, the main difference is that in robust games, the optimal move of the opponent is observed, while in robust MDPs the decision maker only observes the outcome (the opponent chooses the probabilities). The paper makes a small advance on a relevant, non-trivial, and interesting topic, but I am not sure that it is quite ready for publication in its current form. First, the setting is somewhat contrived and not motivated. A natural setting would be simply to use reinforcement learning to learn to act in a robust setting.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper proposes a supervised learning algorithm. It uses stochastic gradient descent and periodically expands the hypothesis space by introducing new basis functions and adding corresponding components to the weight vector. As such, as it processes more data, it fits more complex models. The hypothesis space considered here are polynomials and higher order monomials are gradually introduced to the model. The concept of growing the hypothesis space as more data is introduced is not new (training kernel methods with SGD exhibits this behavior), but in the proposed method, choosing which monomials to add to the hypothesis space is very cheap.
To Reviewer
It seems you misunderstood some key points and details. Hope our explanation below could help to clarify some misunderstandings and confusion. By "specific learning rate schedule", we think We think the empirical evidence is sufficient to verify our theoretical claims. This is exactly the case here. Figure 1(b) in [Triantafillou et al. 2020] shows that the increase of shots For your other comments: 1) The inner-task gap vanishes because the expectation of the loss function w.r.t.