Performance Analysis
Efficient Multiple Incremental Computation for Kernel Ridge Regression with Bayesian Uncertainty Modeling
Chen, Bo-Wei, Abdullah, Nik Nailah Binti, Park, Sangoh
This study presents an efficient incremental/decremental approach for big streams based on Kernel Ridge Regression (KRR), a frequently used data analysis in cloud centers. To avoid reanalyzing the whole dataset whenever sensors receive new training data, typical incremental KRR used a single-instance mechanism for updating an existing system. However, this inevitably increased redundant computational time, not to mention applicability to big streams. To this end, the proposed mechanism supports incremental/decremental processing for both single and multiple samples (i.e., batch processing). A large scale of data can be divided into batches, processed by a machine, without sacrificing the accuracy. Moreover, incremental/decremental analyses in empirical and intrinsic space are also proposed in this study to handle different types of data either with a large number of samples or high feature dimensions, whereas typical methods focused only on one type. At the end of this study, we further the proposed mechanism to statistical Kernelized Bayesian Regression, so that uncertainty modeling with incremental/decremental computation becomes applicable. Experimental results showed that computational time was significantly reduced, better than the original nonincremental design and the typical single incremental method. Furthermore, the accuracy of the proposed method remained the same as the baselines. This implied that the system enhanced efficiency without sacrificing the accuracy. These findings proved that the proposed method was appropriate for variable streaming data analysis, thereby demonstrating the effectiveness of the proposed method.
What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)
Agrawal, Amritanshu, Fu, Wei, Menzies, Tim
Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective: To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method: We use LDADE, a search-based software engineering tool that tunes LDA's parameters using DE (Differential Evolution). LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands ofSoftware Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (VEM,or using Gibbs sampling). Results were scored via topic stability and text mining classification accuracy. Results: In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE's tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning. Conclusion: Due to topic instability, using standard LDA with its "off-the-shelf" settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability.
End-to-End Abnormality Detection in Medical Imaging
Wu, Dufan, Kim, Kyungsang, Dong, Bin, Li, Quanzheng
Nearly all of the deep learning based image analysis methods work on reconstructed images, which are obtained from original acquisitions via solving inverse problems. The reconstruction algorithms are designed for human observers, but not necessarily optimized for DNNs. It is desirable to train the DNNs directly from the original data which lie in a different domain with the images. In this work, we proposed an end-to-end DNN for abnormality detection in medical imaging. A DNN was built as the unrolled version of iterative reconstruction algorithms to map the acquisitions to images, and followed by a 3D convolutional neural network (CNN) to detect the abnormality in the reconstructed images. The two networks were trained jointly in order to optimize the entire DNN for the detection task from the original acquisitions. The DNN was implemented for lung nodule detection in low-dose chest CT. The proposed end-to-end DNN demonstrated better sensitivity and accuracy for the task compared to a two-step approach, in which the reconstruction and detection DNNs were trained separately. A significant reduction of false positive rate on suspicious lesions were observed, which is crucial for the known over-diagnosis in low-dose lung CT imaging. The images reconstructed by the proposed end-to-end network also presented enhanced details in the region of interest.
Trimmed Density Ratio Estimation
Liu, Song, Takeda, Akiko, Suzuki, Taiji, Fukumizu, Kenji
Density ratio estimation (DRE) [18, 11, 27] is an important tool in various branches of machine learning and statistics. Due to its ability of directly modelling the differences between two probability density functions, DRE finds its applications in change detection [13, 6], twosample test [32] and outlier detection [1, 26]. In recent years, a sampling framework called Generative Adversarial Network (GAN) (see e.g., [9, 19]) uses the density ratio function to compare artificial samples from a generative distribution and real samples from an unknown distribution. DRE has also been widely discussed in statistical literatures for adjusting nonparametric density estimation [5], stabilizing the estimation of heavy tailed distribution [7] and fitting multiple distributions at once [8]. However, as a density ratio function can grow unbounded, DRE can suffer from robustness and stability issues: a few corrupted points may completely mislead the estimator (see Figure 2 in Section 6 for example).
Artificial intelligence helps detect ovarian cancer early and accurately
Ovarian cancer is difficult to diagnose, particularly in its early stages, when survival rates are much higher. Because there is no consistently reliable screening test to detect ovarian cancer, most women are diagnosed with the disease when it's in an advanced stage. However, researchers at Brigham and Women's Hospital and Dana-Farber Cancer Institute have developed a non-invasive diagnostic test using artificial intelligence for the accurate detection of true cases of early-stage disease. Results of their study were published online this week in the journal eLife. By combining next generation sequencing with artificial intelligence, researchers have created a novel blood test based on serum microRNAs--small, non-coding pieces of genetic material that help control where and when genes are activated--for the early diagnosis of ovarian cancer.
Testing Machine Learning Algorithms with K-Fold Cross Validation - Talend
In an earlier post on Applying Machine Learning to IoT Sensors, I discussed the process for classifying sensor data with a machine learning algorithm. In this post, I'll give a background on choosing an algorithm, then using a validation technique. For the technique, I'll show how to apply it, and how it can be built using the Talend Studio without hand coding. Given a prediction scenario involving a machine learning algorithm, the first question to ask is what is the appropriate machine learning algorithm? Taking the example of predicting a user's activity based on mobile phone accelerometer data, we must be able to classify a category for the data (resting, walking, or running).
On Fairness and Calibration
Pleiss, Geoff, Raghavan, Manish, Wu, Felix, Kleinberg, Jon, Weinberger, Kilian Q.
The machine learning community has become increasingly concerned with the potential for bias and discrimination in predictive models. This has motivated a growing line of work on what it means for a classification procedure to be "fair." In this paper, we investigate the tension between minimizing error disparity across different population groups while maintaining calibrated probability estimates. We show that calibration is compatible only with a single error constraint (i.e. equal false-negatives rates across groups), and show that any algorithm that satisfies this relaxation is no better than randomizing a percentage of predictions for an existing classifier. These unsettling findings, which extend and generalize existing results, are empirically confirmed on several datasets.
New blood test developed to diagnose ovarian cancer
Investigators from Brigham and Women's Hospital and Dana-Farber Cancer Institute are leveraging the power of artificial intelligence to develop a new technique to detect ovarian cancer early and accurately. The team has identified a network of circulating microRNAs - small, non-coding pieces of genetic material - that are associated with risk of ovarian cancer and can be detected from a blood sample. Their findings are published online in eLife. Most women are diagnosed with ovarian cancer when the disease is at an advanced stage, at which point only about a quarter of patients will survive for at least five years. But for women whose cancer is serendipitously picked up at an early stage, survival rates are much higher.
Union of Intersections (UoI) for Interpretable Data Driven Discovery and Prediction
Bouchard, Kristofer E., Bujan, Alejandro F., Roosta-Khorasani, Farbod, Ubaru, Shashanka, Prabhat, null, Snijders, Antoine M., Mao, Jian-Hua, Chang, Edward F., Mahoney, Michael W., Bhattacharyya, Sharmodeep
The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce Union of Intersections (UoI), a flexible, modular, and scalable framework for enhanced model selection and estimation. Methods based on UoI perform model selection and model estimation through intersection and union operations, respectively. We show that UoI-based methods achieve low-variance and nearly unbiased estimation of a small number of interpretable features, while maintaining high-quality prediction accuracy. We perform extensive numerical investigation to evaluate a UoI algorithm ($UoI_{Lasso}$) on synthetic and real data. In doing so, we demonstrate the extraction of interpretable functional networks from human electrophysiology recordings as well as accurate prediction of phenotypes from genotype-phenotype data with reduced features. We also show (with the $UoI_{L1Logistic}$ and $UoI_{CUR}$ variants of the basic framework) improved prediction parsimony for classification and matrix factorization on several benchmark biomedical data sets. These results suggest that methods based on the UoI framework could improve interpretation and prediction in data-driven discovery across scientific fields.