South America
NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation
Amgad, Mohamed, Atteya, Lamees A., Hussein, Hagar, Mohammed, Kareem Hosny, Hafiz, Ehab, Elsebaie, Maha A. T., Alhusseiny, Ahmed M., AlMoslemany, Mohamed Atef, Elmatboly, Abdelmagid M., Pappalardo, Philip A., Sakr, Rokia Adel, Mobadersany, Pooya, Rachid, Ahmad, Saad, Anas M., Alkashash, Ahmad M., Ruhban, Inas A., Alrefai, Anas, Elgazar, Nada M., Abdulkarim, Ali, Farag, Abo-Alela, Etman, Amira, Elsaeed, Ahmed G., Alagha, Yahya, Amer, Yomna A., Raslan, Ahmed M., Nadim, Menatalla K., Elsebaie, Mai A. T., Ayad, Ahmed, Hanna, Liza E., Gadallah, Ahmed, Elkady, Mohamed, Drumheller, Bradley, Jaye, David, Manthey, David, Gutman, David A., Elfandy, Habiba, Cooper, Lee A. D.
High-resolution mapping of cells and tissue structures provides a foundation for developing interpretable machine-learning models for computational pathology. Deep learning algorithms can provide accurate mappings given large numbers of labeled instances for training and validation. Generating adequate volume of quality labels has emerged as a critical barrier in computational pathology given the time and effort required from pathologists. In this paper we describe an approach for engaging crowds of medical students and pathologists that was used to produce a dataset of over 220,000 annotations of cell nuclei in breast cancers. We show how suggested annotations generated by a weak algorithm can improve the accuracy of annotations generated by non-experts and can yield useful data for training segmentation algorithms without laborious manual tracing. We systematically examine interrater agreement and describe modifications to the MaskRCNN model to improve cell mapping. We also describe a technique we call Decision Tree Approximation of Learned Embeddings (DTALE) that leverages nucleus segmentations and morphologic features to improve the transparency of nucleus classification models. The annotation data produced in this study are freely available for algorithm development and benchmarking at: https://sites.google.com/view/nucls .
A Latent Space Model for Multilayer Network Data
Sosa, Juan, Betancourt, Brenda
In this work, we propose a Bayesian statistical model to simultaneously characterize two or more social networks defined over a common set of actors. The key feature of the model is a hierarchical prior distribution that allows us to represent the entire system jointly, achieving a compromise between dependent and independent networks. Among others things, such a specification easily allows us to visualize multilayer network data in a low-dimensional Euclidean space, generate a weighted network that reflects the consensus affinity between actors, establish a measure of correlation between networks, assess cognitive judgements that subjects form about the relationships among actors, and perform clustering tasks at different social instances. Our model's capabilities are illustrated using several real-world data sets, taking into account different types of actors, sizes, and relations.
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models
Alaa, Ahmed M., van Breugel, Boris, Saveliev, Evgeny, van der Schaar, Mihaela
Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, ($\alpha$-Precision, $\beta$-Recall, Authenticity), that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity. We introduce generalization as an additional, independent dimension (to the fidelity-diversity trade-off) that quantifies the extent to which a model copies training data -- a crucial performance indicator when modeling sensitive data with requirements on privacy. The three metric components correspond to (interpretable) probabilistic quantities, and are estimated via sample-level binary classification. The sample-level nature of our metric inspires a novel use case which we call model auditing, wherein we judge the quality of individual samples generated by a (black-box) model, discarding low-quality samples and hence improving the overall model performance in a post-hoc manner.
FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks
Chen, Lingjiao, Zaharia, Matei, Zou, James
Multi-label classification tasks such as OCR and multi-object recognition are a major focus of the growing machine learning as a service industry. While many multi-label prediction APIs are available, it is challenging for users to decide which API to use for their own data and budget, due to the heterogeneity in those APIs' price and performance. Recent work shows how to select from single-label prediction APIs. However the computation complexity of the previous approach is exponential in the number of labels and hence is not suitable for settings like OCR. In this work, we propose FrugalMCT, a principled framework that adaptively selects the APIs to use for different data in an online fashion while respecting user's budget. The API selection problem is cast as an integer linear program, which we show has a special structure that we leverage to develop an efficient online API selector with strong performance guarantees. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Tencent and other providers for tasks including multi-label image classification, scene text recognition and named entity recognition. Across diverse tasks, FrugalMCT can achieve over 90% cost reduction while matching the accuracy of the best single API, or up to 8% better accuracy while matching the best API's cost.
Deep Learning Approaches for Forecasting Strawberry Yields and Prices Using Satellite Images and Station-Based Soil Parameters
Chaudhary, Mohita, Gastli, Mohamed Sadok, Nassar, Lobna, Karray, Fakhri
Computational tools for forecasting yields and prices for fresh produce have been based on traditional machine learning approaches or time series modeling. We propose here an alternate approach based on deep learning algorithms for forecasting strawberry yields and prices in Santa Barbara county, California. Building the proposed forecasting model comprises three stages: first, the station-based ensemble model (ATT-CNN-LSTM-SeriesNet_Ens) with its compound deep learning components, SeriesNet with Gated Recurrent Unit (GRU) and Convolutional Neural Network LSTM with Attention layer (Att-CNN-LSTM), are trained and tested using the station-based soil temperature and moisture data of Santa Barbara as input and the corresponding strawberry yields or prices as output. Secondly, the remote sensing ensemble model (SIM_CNN-LSTM_Ens), which is an ensemble model of Convolutional Neural Network LSTM (CNN-LSTM) models, is trained and tested using satellite images of the same county as input mapped to the same yields and prices as output. These two ensembles forecast strawberry yields and prices with minimal forecasting errors and highest model correlation for five weeks ahead forecasts. Finally, the forecasts of these two models are ensembled to have a final forecasted value for yields and prices by introducing a voting ensemble. Based on an aggregated performance measure (AGM), it is found that this voting ensemble not only enhances the forecasting performance by 5% compared to its best performing component model but also outperforms the Deep Learning (DL) ensemble model found in literature by 33% for forecasting yields and 21% for forecasting prices.
Muddling Labels for Regularization, a novel approach to generalization
Lounici, Karim, Meziani, Katia, Riu, Benjamin
Generalization is a central problem in Machine Learning. Indeed most prediction methods require careful calibration of hyperparameters usually carried out on a hold-out \textit{validation} dataset to achieve generalization. The main goal of this paper is to introduce a novel approach to achieve generalization without any data splitting, which is based on a new risk measure which directly quantifies a model's tendency to overfit. To fully understand the intuition and advantages of this new approach, we illustrate it in the simple linear regression model ($Y=X\beta+\xi$) where we develop a new criterion. We highlight how this criterion is a good proxy for the true generalization risk. Next, we derive different procedures which tackle several structures simultaneously (correlation, sparsity,...). Noticeably, these procedures \textbf{concomitantly} train the model and calibrate the hyperparameters. In addition, these procedures can be implemented via classical gradient descent methods when the criterion is differentiable w.r.t. the hyperparameters. Our numerical experiments reveal that our procedures are computationally feasible and compare favorably to the popular approach (Ridge, LASSO and Elastic-Net combined with grid-search cross-validation) in term of generalization. They also outperform the baseline on two additional tasks: estimation and support recovery of $\beta$. Moreover, our procedures do not require any expertise for the calibration of the initial parameters which remain the same for all the datasets we experimented on.
Optimizing Inference Performance of Transformers on CPUs
This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing Transfomerbased The Transformer architecture revolutionized the field of natural models on CPUs. We identify the key component of the language processing (NLP). Transformers-based models (e.g., BERT) Transformer architecture where the bulk of the computation happens, power many important Web services, such as search, translation, namely, the matrix multiplication (matmul) operations, and question-answering, etc. While enormous research attention is paid propose three optimizations to speed them up. to the training of those models, relatively little efforts are made The first optimization is based on the observation that the performance to improve their inference performance. This paper comes to address of the matmul operation is heavily impacted not only this gap by presenting an empirical analysis of scalability by the shape (dimensions) of the source matrices and the available and performance of inferencing a Transformer-based model on computing resources (the number of worker threads), but also by CPUs.
Spatio-Temporal Multi-step Prediction of Influenza Outbreaks
Zhang, Jie, Nawata, Kazumitsu, Wu, Hongyan
Flu circulates all over the world. The worldwide infection places a substantial burden on people's health every year. Regardless of the characteristic of the worldwide circulation of flu, most previous studies focused on regional prediction of flu outbreaks. The methodology of considering the spatio-temporal correlation could help forecast flu outbreaks more precisely. Furthermore, forecasting a long-term flu outbreak, and understanding flu infection trends more accurately could help hospitals, clinics, and pharmaceutical companies to better prepare for annual flu outbreaks. Predicting a sequence of values in the future, namely, the multi-step prediction of flu outbreaks should cause concern. Therefore, we highlight the importance of developing spatio-temporal methodologies to perform multi-step prediction of worldwide flu outbreaks. We compared the MAPEs of SVM, RF, LSTM models of predicting flu data of the 1-4 weeks ahead with and without other countries' flu data. We found the LSTM models achieved the lowest MAPEs in most cases. As for countries in the Southern hemisphere, the MAPEs of predicting flu data with other countries are higher than those of predicting without other countries. For countries in the Northern hemisphere, the MAPEs of predicting flu data of the 2-4 weeks ahead with other countries are lower than those of predicting without other countries; and the MAPEs of predicting flu data of the 1-weeks ahead with other countries are higher than those of predicting without other countries, except for the UK. In this study, we performed the spatio-temporal multi-step prediction of influenza outbreaks. The methodology considering the spatio-temporal features improves the multi-step prediction of flu outbreaks.
Boosting Deep Transfer Learning for COVID-19 Classification
Altaf, Fouzia, Islam, Syed M. S., Janjua, Naeem K., Akhtar, Naveed
COVID-19 classification using chest Computed Tomography (CT) has been found pragmatically useful by several studies. Due to the lack of annotated samples, these studies recommend transfer learning and explore the choices of pre-trained models and data augmentation. However, it is still unknown if there are better strategies than vanilla transfer learning for more accurate COVID-19 classification with limited CT data. This paper provides an affirmative answer, devising a novel `model' augmentation technique that allows a considerable performance boost to transfer learning for the task. Our method systematically reduces the distributional shift between the source and target domains and considers augmenting deep learning with complementary representation learning techniques. We establish the efficacy of our method with publicly available datasets and models, along with identifying contrasting observations in the previous studies.
A Thorough View of Exact Inference in Graphs from the Degree-4 Sum-of-Squares Hierarchy
Bello, Kevin, Ke, Chuyang, Honorio, Jean
Performing inference in graphs is a common task within several machine learning problems, e.g., image segmentation, community detection, among others. For a given undirected connected graph, we tackle the statistical problem of exactly recovering an unknown ground-truth binary labeling of the nodes from a single corrupted observation of each edge. Such problem can be formulated as a quadratic combinatorial optimization problem over the boolean hypercube, where it has been shown before that one can (with high probability and in polynomial time) exactly recover the ground-truth labeling of graphs that have an isoperimetric number that grows with respect to the number of nodes (e.g., complete graphs, regular expanders). In this work, we apply a powerful hierarchy of relaxations, known as the sum-of-squares (SoS) hierarchy, to the combinatorial problem. Motivated by empirical evidence on the improvement in exact recoverability, we center our attention on the degree-4 SoS relaxation and set out to understand the origin of such improvement from a graph theoretical perspective. We show that the solution of the dual of the relaxed problem is related to finding edge weights of the Johnson and Kneser graphs, where the weights fulfill the SoS constraints and intuitively allow the input graph to increase its algebraic connectivity. Finally, as byproduct of our analysis, we derive a novel Cheeger-type lower bound for the algebraic connectivity of graphs with signed edge weights.