Goto

Collaborating Authors

 model evaluation


Automatic Unsupervised Outlier Model Selection

Neural Information Processing Systems

Given an unsupervised outlier detection task on a new dataset, how can we automatically select a good outlier detection algorithm and its hyperparameter(s) (collectively called a model)? In this work, we tackle the unsupervised outlier model selection (UOMS) problem, and propose METAOD, a principled, data-driven approach to UOMS based on meta-learning. The UOMS problem is notoriously challenging, as compared to model selection for classification and clustering, since (i) model evaluation is infeasible due to the lack of hold-out data with labels, and (ii) model comparison is infeasible due to the lack of a universal objective function. METAOD capitalizes on the performances of a large body of detection models on historical outlier detection benchmark datasets, and carries over this prior experience to automatically select an effective model to be employed on a new dataset without any labels, model evaluations or model comparisons. To capture task similarity within our meta-learning framework, we introduce specialized metafeatures that quantify outlying characteristics of a dataset. Extensive experiments show that selecting a model by METAOD significantly outperforms no model selection (e.g.


the Hamiltonian bound

Neural Information Processing Systems

Algorithm 6 Generating the (non-differentiable) Hamiltonian AIS variational bound. Figure 1 shows the results. The first row shows the results obtained by tuning the pair (,η) and each other parameter individually for different values of K, and the second row shows the results obtained by tuning increasingly more parameters. It can be observed that tuning β and q(z) lead to the largest gains in performance. Figure 4: Tuning more parameters leads to significantly better results.


RAAGBl Wh25-3535-5050-6565-80Acc 2 s21 s63 s74 s54 s298 s685 s660 s40% 0�mpaaaacmpmpmpmpiaaaECEtkmpmpmpsleeEtllllseeeeilllmsssseeesss ate MAE vs Oracle

Neural Information Processing Systems

Evaluating the performance of machine learning models on diverse and underrepresented subgroups is essential for ensuring fairness and reliability in real-world applications. However, accurately assessing model performance becomes challenging due to two main issues: (1) a scarcity of test data, especially for small subgroups, and (2) possible distributional shifts in the model's deployment setting, which may not align with the available test data. In this work, we introduce 3STesting, a deep generative modeling framework to facilitate model evaluation by generating synthetic test sets for small subgroups and simulating distributional shifts. Our experiments demonstrate that 3STesting outperforms traditional baselines--including real test data alone--in estimating model performance on minority subgroups and under plausible distributional shifts. In addition, 3S offers intervals around its performance estimates, exhibiting superior coverage of the ground truth compared to existing approaches. Overall, these results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data.


Weak Supervision Performance Evaluation via Partial Identification

Neural Information Processing Systems

Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels, utilizing weak labels from heuristics, crowdsourcing, or pre-trained models. However, the absence of ground truth complicates model evaluation, as traditional metrics such as accuracy, precision, and recall cannot be directly calculated. In this work, we present a novel method to address this challenge by framing model evaluation as a partial identification problem and estimating performance bounds using Fréchet bounds. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques. Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications.





GNNEvaluator: Evaluating GNN Performance On Unseen Graphs Without Labels

Neural Information Processing Systems

DiscGraph set captures wide-range and diverse graph data distribution discrepancies through a discrepancy measurement function, which exploits the outputs of GNNs related to latent node embeddings and node class predictions.