tnr
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California (0.04)
- North America > Canada (0.04)
- (4 more...)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.94)
- Government (1.00)
- Health & Medicine (0.93)
- Law (0.68)
mhealth_ood_neurips_2021.pdf
In this section, we provide screenshots and list of examples that were used in the user study. Note that the name of the institution is redacted for the review. This shows an example interface for skin cancer classifier. Figure 4: Interface to display different input data types. Figure 5: List of input examples used in the user study.
Variation in Verification: Understanding Verification Dynamics in Large Language Models
Zhou, Yefan, Xu, Austin, Zhou, Yilun, Singh, Janvijay, Gui, Jiang, Joty, Shafiq
Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions - problem difficulty, generator capability, and verifier generation capability - with empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier's own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities to optimize basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.5%). Second, we identify cases where strong verifiers offer limited advantage over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.
- North America > United States > Virginia (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- (4 more...)
Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins
Isaku, Erblin, Sartaj, Hassan, Ali, Shaukat, Sanguino, Beatriz, Wang, Tongtong, Li, Guoyuan, Zhang, Houxiang, Peyrucain, Thomas
Self-adaptive robots (SARs) in complex, uncertain environments must proactively detect and address abnormal behaviors, including out-of-distribution (OOD) cases. To this end, digital twins offer a valuable solution for OOD detection. Thus, we present a digital twin-based approach for OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to forecast SAR states and employs reconstruction error and Monte Carlo dropout for uncertainty quantification. By combining reconstruction error with predictive variance, the digital twin effectively detects OOD behaviors, even in previously unseen conditions. The digital twin also includes an explainability layer that links potential OOD to specific SAR states, offering insights for self-adaptation. We evaluated ODiSAR by creating digital twins of two industrial robots: one navigating an office environment, and another performing maritime ship navigation. In both cases, ODiSAR forecasts SAR behaviors (i.e., robot trajectories and vessel motion) and proactively detects OOD events. Our results showed that ODiSAR achieved high detection performance -- up to 98\% AUROC, 96\% TNR@TPR95, and 95\% F1-score -- while providing interpretable insights to support self-adaptation.
- Health & Medicine (0.96)
- Energy > Renewable > Wind (0.46)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California (0.04)
- (6 more...)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.94)
- Government (1.00)
- Health & Medicine (0.93)
- Banking & Finance (0.68)
- Law (0.68)
Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification
Evaluation Metrics is an important question for model evaluation and model selection in binary classification tasks. This study investigates how consistent metrics are at evaluating different models under different data scenarios. Analyzing over 150 data scenarios and 18 model evaluation metrics using statistical simulation, I find that for binary classification tasks, evaluation metrics that are less influenced by prevalence offer more consistent ranking of a set of different models. In particular, Area Under the ROC Curve (AUC) has smallest variance in ranking of different models. Matthew's correlation coefficient as a more strict measure of model performance has the second smallest variance. These patterns holds across a rich set of data scenarios and five commonly used machine learning models as well as a naive random guess model. The results have significant implications for model evaluation and model selection in binary classification tasks.
- North America > United States > Illinois (0.04)
- North America > United States > Florida > Broward County (0.04)
- Europe > Portugal (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
Predicting Overtakes in Trucks Using CAN Data
Butt, Talha Hanif, Tiwari, Prayag, Alonso-Fernandez, Fernando
Safe overtakes in trucks are crucial to prevent accidents, reduce congestion, and ensure efficient traffic flow, making early prediction essential for timely and informed driving decisions. Accordingly, we investigate the detection of truck overtakes from CAN data. Three classifiers, Artificial Neural Networks (ANN), Random Forest, and Support Vector Machines (SVM), are employed for the task. Our analysis covers up to 10 seconds before the overtaking event, using an overlapping sliding window of 1 second to extract CAN features. We observe that the prediction scores of the overtake class tend to increase as we approach the overtake trigger, while the no-overtake class remain stable or oscillates depending on the classifier. Thus, the best accuracy is achieved when approaching the trigger, making early overtaking prediction challenging. The classifiers show good accuracy in classifying overtakes (Recall/TPR > 93%), but accuracy is suboptimal in classifying no-overtakes (TNR typically 80-90% and below 60% for one SVM variant). We further combine two classifiers (Random Forest and linear SVM) by averaging their output scores. The fusion is observed to improve no-overtake classification (TNR > 92%) at the expense of reducing overtake accuracy (TPR). However, the latter is kept above 91% near the overtake trigger. Therefore, the fusion balances TPR and TNR, providing more consistent performance than individual classifiers.
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- Europe > Sweden > Halland County > Halmstad (0.04)
On support vector machines under a multiple-cost scenario
Benítez-Peña, Sandra, Blanquero, Rafael, Carrizosa, Emilio, Ramírez-Cobo, Pepa
Support Vector Machine (SVM) is a powerful tool in binary classification, known to attain excellent misclassification rates. On the other hand, many realworld classification problems, such as those found in medical diagnosis, churn or fraud prediction, involve misclassification costs which may be different in the different classes. However, it may be hard for the user to provide precise values for such misclassification costs, whereas it may be much easier to identify acceptable misclassification rates values. In this paper we propose a novel SVM model in which misclassification costs are considered by incorporating performance constraints in the problem formulation. Specifically, our aim is to seek the hyperplane with maximal margin yielding misclassification rates below given threshold values. Such maximal margin hyperplane is obtained by solving a quadratic convex problem with linear constraints and integer variables. The reported numerical experience shows that our model gives the user control on the misclassification rates in one class (possibly at the expense of an increase in misclassification rates for the other class) and is feasible in terms of running times.
- North America > United States > Wisconsin (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Spain > Andalusia > Seville Province > Seville (0.04)
- (6 more...)