confidence score
Flowing with Confidence
de Kruiff, Friso, Coscia, Dario, Welling, Max, Bekkers, Erik
Generative models can produce nonsensical text, unrealistic images, and unstable materials faster than simulation or human review can absorb; without per-sample confidence, trust erodes. Existing fixes run $k$ ensembles or stochastic trajectories at $k\times$ compute, measuring variability between models, not model confidence. We propose Flow Matching with Confidence (FMwC). FMwC injects input-dependent multiplicative noise at selected layers, propagates its variance through the network in closed form, and integrates it along the ODE trajectory, yielding a per-sample confidence score at standard sampling cost. The score supports multiple uses: filtering improves image quality and thermodynamic stability of crystals; editing rewinds trajectories to the points where the model commits and redirects them; and adaptive stepping concentrates ODE compute where the flow is ambiguous. We find that the confidence score correlates with the magnitude of the divergence of the learned velocity field, which gives us a window to understand the generative process, opening up surgical forms of guidance that target the moments that matter, new sampling algorithms and interpretability of generative models.
Continuous Heatmap Regression for Pose Estimation via Implicit Neural Representation
Heatmap regression has dominated human pose estimation due to its superior performance and strong generalization. To meet the requirements of traditional explicit neural networks for output form, existing heatmap-based methods discretize the originally continuous heatmap representation into 2D pixel arrays, which leads to performance degradation due to the introduction of quantization errors. This problem is significantly exacerbated as the size of the input image decreases, which makes heatmap-based methods not much better than coordinate regression on low-resolution images. In this paper, we propose a novel neural representation for human pose estimation called NerPE to achieve continuous heatmap regression. Given any position within the image range, NerPE regresses the corresponding confidence scores for body joints according to the surrounding image features, which guarantees continuity in space and confidence during training. Thanks to the decoupling from spatial resolution, NerPE can output the predicted heatmaps at arbitrary resolution during inference without retraining, which easily achieves sub-pixel localization precision. To reduce the computational cost, we design progressive coordinate decoding to cooperate with continuous heatmap regression, in which localization no longer requires the complete generation of high-resolution heatmaps.
Supplementary
Contents1 1 PrinCut 22 1.1 How to use PrinCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Do not distribute. 1 PrinCut22 1.1 How to use PrinCut23 The PrinCut GUI is shown in Figure 1. PrinCut is a MATLAB app, and its package is also provided24 in the supplementary. The left shows raw data without annotation. The right shows both raw data and annotation overlay.
NIS3D: ACompletely Annotated Benchmark for Dense 3DNuclei Image Segmentation
The5 existing nuclei segmentation benchmarks either worked on 2D only or annotated6 a small number of 3D cells, perhaps due to the high cost of 3D annotation for7 large-scale data. To fulfill the critical need, we constructed NIS3D, a 3D, high8 cell density, large-volume, and completely annotated Nuclei Image Segmentation9 benchmark, assisted by our newly designed semi-automatic annotation software.10 NIS3D provides more than 22,000 cells across multiple most-used species in this11 area. Each cell is labeled by three independent annotators, so we can measure the12 variability of each annotation. A confidence score is computed for each cell, allow-13 ing more nuanced testing and performance comparison. A comprehensive review14 on the methods of segmenting 3D dense nuclei was conducted. The benchmark was15 used to evaluate the performance of several selected state-of-the-art segmentation16 algorithms. The best of current methods is still far away from human-level accuracy,17 corroborating the necessity of generating such a benchmark. The testing results18 also demonstrated the strength and weakness of each method and pointed out the19 directions of further methodological development.
CNN+ RPNClassificationRegressionMaskInput ImageLong-Tailed Object DetectionPost-Processing CalibrationBulldozerSchool busTruckBulldozerSchool busTruckNORCAL
Vanilla models for object detection and instance segmentation suffer from the heavy bias toward detecting frequent objects in the long-tailed setting. Existing methods address this issue mostly during training, e.g., by re-sampling or reweighting. In this paper, we investigate a largely overlooked approach -- postprocessing calibration of confidence scores. We propose NORCAL, Normalized Calibration for long-tailed object detection and instance segmentation, a simple and straightforward recipe that reweighs the predicted scores of each class by its training sample size. We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance. On the LVIS dataset, NORCAL can effectively improve nearly all the baseline models not only on rare classes but also on common and frequent classes. Finally, we conduct extensive analysis and ablation studies to offer insights into various modeling choices and mechanisms of our approach. Our code is publicly available at https://github.com/tydpan/NorCal.
AutoPSV: Automated Process-Supervised Verifier
This verification model assigns a confidence score to each reasoning step, indicating the probability of arriving at the correct final answer from that point onward.We detect relative changes in the verification's confidence scores across reasoning steps to automatically annotate the reasoning process, enabling error detection even in scenarios where ground truth answers are unavailable. This alleviates the need for numerous manual annotations or the high computational costs associated with model-induced annotation approaches.We experimentally validate that the step-level confidence changes learned by the verification model trained on the final answer correctness can effectively identify errors in the reasoning steps.We demonstrate that the verification model, when trained on process annotations generated by \textsc{AutoPSV}, exhibits improved performance in selecting correct answers from multiple LLM-generated outputs.Notably, we achieve substantial improvements across five datasets in mathematics and commonsense reasoning.
Improving Simple Models with Confidence Profiles
In this paper, we propose a new method called ProfWeight for transferring information from a pre-trained deep neural network that has a high test accuracy to a simpler interpretable model or a very shallow network of low complexity and a priori low test accuracy. We are motivated by applications in interpretability and model deployment in severely memory constrained environments (like sensors). Our method uses linear probes to generate confidence scores through flattened intermediate representations. Our transfer method involves a theoretically justified weighting of samples during the training of the simple model using confidence scores of these intermediate layers. The value of our method is first demonstrated on CIFAR-10, where our weighting method significantly improves (3-4\%) networks with only a fraction of the number of Resnet blocks of a complex Resnet model. We further demonstrate operationally significant results on a real manufacturing problem, where we dramatically increase the test accuracy of a CART model (the domain standard) by roughly $13\%$.