Goto

Collaborating Authors

 validation study


A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

arXiv.org Artificial Intelligence

Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.


Shall Your Data Strategy Work? Perform a Swift Study

arXiv.org Artificial Intelligence

This work presents a swift method to assess the efficacy of particular types of instruction-tuning data, utilizing just a handful of probe examples and eliminating the need for model retraining. This method employs the idea of gradient-based data influence estimation, analyzing the gradient projections of probe examples from the chosen strategy onto evaluation examples to assess its advantages. Building upon this method, we conducted three swift studies to investigate the potential of Chain-of-thought (CoT) data, query clarification data, and response evaluation data in enhancing model generalization. Subsequently, we embarked on a validation study to corroborate the findings of these swift studies. In this validation study, we developed training datasets tailored to each studied strategy and compared model performance with and without the use of these datasets. The results of the validation study aligned with the findings of the swift studies, validating the efficacy of our proposed method.


A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions

arXiv.org Artificial Intelligence

Prediction models inform important clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. However, changes in the distribution of the data impact model performance. In health-care, a typical change is a shift in case-mix: for example, for cardiovascular risk management, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital. This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis predictions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not. A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. This framework provides critical insights for evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task.


Reporting of prognostic clinical prediction models based on machine learning methods in oncology needs to be improved - Journal of Clinical Epidemiology

#artificialintelligence

Evaluate the completeness of reporting of prognostic prediction models developed using machine learning methods in the field of oncology. We conducted a systematic review, searching the MEDLINE and Embase databases between 01/01/2019 and 05/09/2019, for non-imaging studies developing a prognostic clinical prediction model using machine learning methods (as defined by primary study authors) in oncology. We used the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement to assess the reporting quality of included publications. We described overall reporting adherence of included publications and by each section of TRIPOD. Sixty-two publications met the inclusion criteria.


Adversarial Scrutiny of Evidentiary Statistical Software

arXiv.org Artificial Intelligence

The U.S. criminal legal system increasingly relies on software output to convict and incarcerate people. In a large number of cases each year, the government makes these consequential decisions based on evidence from statistical software -- such as probabilistic genotyping, environmental audio detection, and toolmark analysis tools -- that defense counsel cannot fully cross-examine or scrutinize. This undermines the commitments of the adversarial criminal legal system, which relies on the defense's ability to probe and test the prosecution's case to safeguard individual rights. Responding to this need to adversarially scrutinize output from such software, we propose robust adversarial testing as an audit framework to examine the validity of evidentiary statistical software. We define and operationalize this notion of robust adversarial testing for defense use by drawing on a large body of recent work in robust machine learning and algorithmic fairness. We demonstrate how this framework both standardizes the process for scrutinizing such tools and empowers defense lawyers to examine their validity for instances most relevant to the case at hand. We further discuss existing structural and institutional challenges within the U.S. criminal legal system that may create barriers for implementing this and other such audit frameworks and close with a discussion on policy changes that could help address these concerns.


Reporting of prognostic clinical prediction models based on machine learning methods in oncology needs to be improved

#artificialintelligence

Evaluate the completeness of reporting of prognostic prediction models developed using machine learning methods in the field of oncology. We conducted a systematic review, searching the MEDLINE and Embase databases between 01/01/2019 and 05/09/2019, for non-imaging studies developing a prognostic clinical prediction model using machine learning methods (as defined by primary study authors) in oncology. We used the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement to assess the reporting quality of included publications. We described overall reporting adherence of included publications and by each section of TRIPOD. Sixty-two publications met the inclusion criteria.


Detecting Spurious Correlations with Sanity Tests for Artificial Intelligence Guided Radiology Systems

arXiv.org Machine Learning

Artificial intelligence (AI) has been successful at solving numerous problems in machine perception. In radiology, AI systems are rapidly evolving and show progress in guiding treatment decisions, diagnosing, localizing disease on medical images, and improving radiologists' efficiency. A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety. The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset from one or more institutions, followed by a clinical validation study of the system's efficacy during deployment. Clinical validation studies are time-consuming, and best practices dictate limited re-use of analytical validation data, so it is ideal to know ahead of time if a system is likely to fail analytical or clinical validation. In this paper, we describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons. We illustrate the sanity tests' value by designing a deep learning system to classify pancreatic cancer seen in computed tomography scans.


Measurement Error in Nutritional Epidemiology: A Survey

arXiv.org Machine Learning

This article reviews bias-correction models for measurement error of exposure variables in the field of nutritional epidemiology. Measurement error usually attenuates estimated slope towards zero. Due to the influence of measurement error, inference of parameter estimate is conservative and confidence interval of the slope parameter is too narrow. Bias-correction in estimators and confidence intervals are of primary interest. We review the following bias-correction models: regression calibration methods, likelihood based models, missing data models, simulation based methods, nonparametric models and sampling based procedures.


What Can We Expect Following Anterior Total Hip Arthroplasty on a Regular Operating Table? A Validation Study of an Artificial Intelligence Algorithm to Monitor Adverse Events in a High-Volume, Nonacademic Setting

#artificialintelligence

Quality monitoring is increasingly important to support and assure sustainability of the orthopedic practice. Surgeons in nonacademic settings often lack resources to accurately monitor quality of care. Widespread use of electronic medical records (EMR) provides easier access to medical information, facilitating its analysis. However, manual review of EMRs is highly inefficient. Artificial intelligence (AI) software allows for the development of algorithms for extracting relevant complications from EMRs.


Semiparametric Methods for Exposure Misclassification in Propensity Score-Based Time-to-Event Data Analysis

arXiv.org Machine Learning

In epidemiology, identifying the effect of exposure variables in relation to a time-to-event outcome is a classical research area of practical importance. Incorporating propensity score in the Cox regression model, as a measure to control for confounding, has certain advantages when outcome is rare. However, in situations involving exposure measured with moderate to substantial error, identifying the exposure effect using propensity score in Cox models remains a challenging yet unresolved problem. In this paper, we propose an estimating equation method to correct for the exposure misclassification-caused bias in the estimation of exposure-outcome associations. We also discuss the asymptotic properties and derive the asymptotic variances of the proposed estimators. We conduct a simulation study to evaluate the performance of the proposed estimators in various settings. As an illustration, we apply our method to correct for the misclassification-caused bias in estimating the association of PM2.5 level with lung cancer mortality using a nationwide prospective cohort, the Nurses' Health Study (NHS). The proposed methodology can be applied using our user-friendly R function published online.