eval
Explanations that reveal all through the definition of encoding
Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a definition of encoding that identifies this extra predictive power via conditional dependence and show that the definition fits existing examples of encoding. This definition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a "what you see is what you get" property, which makes them transparent and simple to use.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- (2 more...)
Arbitrarily Scalable Environment Generators via Neural Cellular Automata
We study the problem of generating arbitrarily large environments to improve the throughput of multi-robot systems. Prior work proposes Quality Diversity (QD) algorithms as an effective method for optimizing the environments of automated warehouses. However, these approaches optimize only relatively small environments, falling short when it comes to replicating real-world warehouse sizes. The challenge arises from the exponential increase in the search space as the environment size increases.
- North America > United States > California (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Illinois (0.04)
- North America > Canada (0.04)
Appendix
Overconfidence in deep neural networks could easily lead to deployments where predictions are made that should have been withheld. Figure 7: ResNet-50 trained onCIFAR-10 using focal lossγ = 0,3,4,5. Similarly, the confidence of the top predicted classˆy (for the training sample) isdenoted byˆptrain,top and theaverage equivalent inabinbyCtrain,top. Forthe training set, we care only about the confidence ofthe "true class"ˆptrain,true asthat isthe quantity which gets manipulated by some loss function. For validation set, on the other hand, we care about the confidence of the "top predicted class".
Measuring all the noises of LLM Evals
Separating signal from noise is central to experimental science. Applying well-established statistical method effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.