promise and pitfall
Promises and Pitfalls of Threshold-based Auto-labeling
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems.
The promises and pitfalls of Stochastic Gradient Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated spectacular successes in machine learning tasks. The current practice is to set the step size inversely proportional to N where N is the number of training samples. As N becomes large, we show that the SGLD algorithm has an invariant probability measure which significantly departs from the target posterior and behaves like as Stochastic Gradient Descent (SGD). This difference is inherently due to the high variance of the stochastic gradients. Several strategies have been suggested to reduce this effect; among them, SGLD Fixed Point (SGLDFP) uses carefully designed control variates to reduce the variance of the stochastic gradients. We show that SGLDFP gives approximate samples from the posterior distribution, with an accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear in the number of data points. We provide a detailed analysis of the Wasserstein distances between LMC, SGLD, SGLDFP and SGD and explicit expressions of the means and covariance matrices of their invariant distributions. Our findings are supported by limited numerical experiments.
Response to Promises and Pitfalls of Deep Kernel Learning
Wilson, Andrew Gordon, Hu, Zhiting, Salakhutdinov, Ruslan, Xing, Eric P.
This note responds to "Promises and Pitfalls of Deep Kernel Learning" (Ober et al., 2021). The marginal likelihood of a Gaussian process can be compartmentalized into a data fit term and a complexity penalty. Ober et al. (2021) shows that if a kernel can be multiplied by a signal variance coefficient, then reparametrizing and substituting in the maximized value of this parameter sets a reparametrized data fit term to a fixed value. They use this finding to argue that the complexity penalty, a log determinant of the kernel matrix, then dominates in determining the other values of kernel hyperparameters, which can lead to data overcorrelation. By contrast, we show that the reparametrization in fact introduces another data-fit term which influences all other kernel hyperparameters. Thus, a balance between data fit and complexity still plays a significant role in determining kernel hyperparameters.
- North America > United States > New York (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions
Anzenberg, Eitan, Samajpati, Arunava, Chandrasekar, Sivasankaran, Kacholia, Varun
The use of large language models (LLMs) in hiring promises to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias where sufficient safeguards are not in place. In this work, we benchmark several state-of-the-art foundational LLMs - including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model's predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We discuss why pretraining biases may cause LLMs with insufficient safeguards to propagate societal biases in hiring scenarios, whereas a bespoke supervised model can more effectively mitigate these biases. Our findings highlight the importance of domain-specific modeling and bias auditing when deploying AI in high-stakes domains such as hiring, and caution against relying on off-the-shelf LLMs for such tasks without extensive fairness safeguards. Furthermore, we show with empirical evidence that there shouldn't be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes.
- Law (0.94)
- Government (0.93)
Promises and Pitfalls of Threshold-based Auto-labeling
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.
Reviews: The promises and pitfalls of Stochastic Gradient Langevin Dynamics
Review after rebuttal: I thank the author(s) for their response. While I still believe that this paper is a minor increment beyond what has already been done on SGLD, I agree that the message might be useful for some. I also appreciate the effort the authors have made in improving the manuscript based on reviews' suggestions, particularly their efforts to include relevant numerical experiments to ML scenarios, and recommendations beyond the CV approach which has been studied to exhaustion and rarely applicable in practice. Based on this, I've adjusted my decision to marginally above threshold. Original review: In the paper "The promises and pitfalls of Stochastic Gradient Langevin Dynamics" the authors revisit the Stochastic Langevin Gradient Dynamics (SGLD) approach to approximately sampling from a probability distribution using stochastic gradients (specifically subsampling). The authors compare a number of different classes of approximate inference method, including SGLD, LMC (known by some as Unadjusted Langevin Algorithm or ULA) and Stochastic Gradient Langevin Dynamics Fixed Point (SGLDFP) -- the latter being a variant of SGLD with a control variate exploiting the unimodality of the distribution, similar to what has been presented in [3, 25 and others].
Promises and pitfalls of artificial intelligence for legal applications
Kapoor, Sayash, Henderson, Peter, Narayanan, Arvind
Is AI set to redefine the legal profession? We argue that this claim is not supported by the current evidence. We dive into AI's increasingly prevalent roles in three types of legal tasks: information processing; tasks involving creativity, reasoning, or judgment; and predictions about the future. We find that the ease of evaluating legal applications varies greatly across legal tasks, based on the ease of identifying correct answers and the observability of information relevant to the task at hand. Tasks that would lead to the most significant changes to the legal profession are also the ones most prone to overoptimism about AI capabilities, as they are harder to evaluate. We make recommendations for better evaluation and deployment of AI in legal contexts.
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (9 more...)
- Law > Litigation (1.00)
- Law > Intellectual Property & Technology Law (0.93)
- Government > Regional Government > North America Government > United States Government (0.68)
Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation
Khan, Muhammad Fawad Akbar, Ramsdell, Max, Falor, Erik, Karimi, Hamid
This paper presents a comprehensive evaluation of the code generation capabilities of ChatGPT, a prominent large language model, compared to human programmers. A novel dataset of 131 code-generation prompts across 5 categories was curated to enable robust analysis. Code solutions were generated by both ChatGPT and humans for all prompts, resulting in 262 code samples. A meticulous manual assessment methodology prioritized evaluating correctness, comprehensibility, and security using 14 established code quality metrics. The key findings reveal ChatGPT's strengths in crafting concise, efficient code with advanced constructs, showcasing strengths in data analysis tasks (93.1% accuracy) but limitations in visual-graphical challenges. Comparative analysis with human code highlights ChatGPT's inclination towards modular design and superior error handling. Additionally, machine learning models effectively distinguished ChatGPT from human code with up to 88% accuracy, suggesting detectable coding style disparities. By providing profound insights into ChatGPT's code generation capabilities and limitations through quantitative metrics and qualitative analysis, this study makes valuable contributions toward advancing AI-based programming assistants. The curated dataset and methodology offer a robust foundation for future research in this nascent domain. All data and codes are available on https://github.com/DSAatUSU/ChatGPT-promises-and-pitfalls.
- North America > United States > Utah > Cache County > Logan (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > California (0.04)
Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization
Kristiadi, Agustinus, Immer, Alexander, Eschenhagen, Runa, Fortuin, Vincent
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. It is theoretically compelling since it can be seen as a Gaussian process posterior with the mean function given by the neural network's maximum-a-posteriori predictive function and the covariance function induced by the empirical neural tangent kernel. However, while its efficacy has been studied in large-scale tasks like image classification, it has not been studied in sequential decision-making problems like Bayesian optimization where Gaussian processes -- with simple mean functions and kernels such as the radial basis function -- are the de-facto surrogate models. In this work, we study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility. However, we also present some pitfalls that might arise and a potential problem with the LLA when the search space is unbounded.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Three experts discuss Midjourney's promise and pitfalls
This summer, text-to-image AIs have captured the imagination of architects. The software is a powerful tool, but one that should be integrated into ongoing discussions of architectural image making, technology, representation, bias, education, and labor. AN gathered Kory Bieg, Shelby Doyle, and Andrew Kudless to discuss these issues. The Architect's Newspaper: To start, could you share how you've been using Midjourney and related AI platforms so far? What kinds of explorations have you done? What types of images have you been making? So far, it's been for open exploration. I'm trying to understand how to communicate with AI. On one hand, you can write a text and hope to get something that's related to the text.
- North America > United States > Iowa (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- Asia > China > Beijing > Beijing (0.04)