Goto

Collaborating Authors

 Accuracy


Regularised Text Logistic Regression: Key Word Detection and Sentiment Classification for Online Reviews

arXiv.org Machine Learning

Online customer reviews have become important for managers and executives in the hospitality and catering industry who wish to obtain a comprehensive understanding of their customers' demands and expectations. We propose a Regularized Text Logistic (RTL) regression model to perform text analytics and sentiment classification on unstructured text data, which automatically identifies a set of statistically significant and operationally insightful word features, and achieves satisfactory predictive classification accuracy. We apply the RTL model to two online review datasets, Restaurant and Hotel, from TripAdvisor. Our results demonstrate satisfactory classification performance compared with alternative classifiers with a highest true positive rate of 94.9%. Moreover, RTL identifies a small set of word features, corresponding to 3% for Restaurant and 20% for Hotel, which boosts working efficiency by allowing managers to drill down into a much smaller set of important customer reviews. We also develop the consistency, sparsity and oracle property of the estimator.


Massachusetts suspends Boston-based coronavirus testing lab Orig3n after nearly 400 false positives

Boston Herald

The state has suspended Boston-based COVID-19 testing lab Orig3n Laboratory after it produced nearly 400 false positive results. Public health officials became aware in early August of an "unusually high positivity rate" among the lab's test results and requested that Orig3n stop testing for the virus as of Aug. 8. Specimens were sent to an independent lab to be retested as part of a state Department of Public Health investigation, and the results showed at least 383 false positives. On Aug. 27, the state Department of Public Health notified Orig3n of "three significant certification deficiencies that put patients at immediate risk of harm," according to a DPH spokeswoman. They included the failure of the lab's director to provide overall management, issues with the extraction phase of testing, and a failure to meet analytic requirements such as documenting the daily sanitizing of equipment used for coronavirus testing. A statement of deficiency was issued on Sept. 4. The lab must now respond with a written plan of correction by Sept. 14, "and if action is not taken it can face sanctions," DPH said.


Highly Accurate CNN Inference Using Approximate Activation Functions over Homomorphic Encryption

arXiv.org Machine Learning

In the big data era, cloud-based machine learning as a service (MLaaS) has attracted considerable attention. However, when handling sensitive data, such as financial and medical data, a privacy issue emerges, because the cloud server can access clients' raw data. A common method of handling sensitive data in the cloud uses homomorphic encryption, which allows computation over encrypted data without decryption. Previous research usually adopted a low-degree polynomial mapping function, such as the square function, for data classification. However, this technique results in low classification accuracy. In this study, we seek to improve the classification accuracy for inference processing in a convolutional neural network (CNN) while using homomorphic encryption. We adopt an activation function that approximates Google's Swish activation function while using a fourth-order polynomial. We also adopt batch normalization to normalize the inputs for the Swish function to fit the input range to minimize the error. We implemented CNN inference labeling over homomorphic encryption using the Microsoft's Simple Encrypted Arithmetic Library for the Cheon-Kim-Kim-Song (CKKS) scheme. The experimental evaluations confirmed classification accuracies of 99.22% and 80.48% for MNIST and CIFAR-10, respectively, which entails 0.04% and 4.11% improvements, respectively, over previous methods.


A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments

arXiv.org Machine Learning

Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and have the potential to improve model performance. ML permeates most subfields of biomedical research with varying levels of rigor and correct usage. Tremendous opportunities offered by ML are frequently offset by the challenge of assembling comprehensive analysis pipelines, and the ease of ML misuse. In this work we have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification (i.e. case/control prediction), and applied this pipeline to both simulated and real world data. At a high level, this 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms, each with hyperparameter optimization, and e) thorough evaluation, including appropriate metrics, statistical analyses, and novel visualizations. This pipeline organizes the many subtle complexities of ML pipeline assembly to illustrate best practices to avoid bias and ensure reproducibility. Additionally, this pipeline is the first to compare established ML algorithms to 'ExSTraCS', a rule-based ML algorithm with the unique capability of interpretably modeling heterogeneous patterns of association. While designed to be widely applicable we apply this pipeline to an epidemiological investigation of established and newly identified risk factors for pancreatic cancer to evaluate how different sources of bias might be handled by ML algorithms.


Quantifying Explainability of Saliency Methods in Deep Neural Networks

arXiv.org Artificial Intelligence

One way to achieve eXplainable artificial intelligence (XAI) is through the use of post-hoc analysis methods. In particular, methods that generate heatmaps have been used to explain black-box models, such as deep neural network. In some cases, heatmaps are appealing due to the intuitive and visual ways to understand them. However, quantitative analysis that demonstrates the actual potential of heatmaps have been lacking, and comparison between different methods are not standardized as well. In this paper, we introduce a synthetic data that can be generated adhoc along with the ground-truth heatmaps for better quantitative assessment. Each sample data is an image of a cell with easily distinguishable features, facilitating a more transparent assessment of different XAI methods. Comparison and recommendations are made, shortcomings are clarified along with suggestions for future research directions to handle the finer details of select post-hoc analysis methods.


High-throughput relation extraction algorithm development associating knowledge articles and electronic health records

arXiv.org Machine Learning

Objective: Medical relations are the core components of medical knowledge graphs that are needed for healthcare artificial intelligence. However, the requirement of expert annotation by conventional algorithm development processes creates a major bottleneck for mining new relations. In this paper, we present Hi-RES, a framework for high-throughput relation extraction algorithm development. We also show that combining knowledge articles with electronic health records (EHRs) significantly increases the classification accuracy. Methods: We use relation triplets obtained from structured databases and semistructured webpages to label sentences from target corpora as positive training samples. Two methods are also provided for creating improved negative samples by combining positive samples with na\"ive negative samples. We propose a common model that summarizes sentence information using large-scale pretrained language models and multi-instance attention, which then joins with the concept embeddings trained from the EHRs for relation prediction. Results: We apply the Hi-RES framework to develop classification algorithms for disorder-disorder relations and disorder-location relations. Millions of sentences are created as training data. Using pretrained language models and EHR-based embeddings individually provides considerable accuracy increases over those of previous models. Joining them together further tremendously increases the accuracy to 0.947 and 0.998 for the two sets of relations, respectively, which are 10-17 percentage points higher than those of previous models. Conclusion: Hi-RES is an efficient framework for achieving high-throughput and accurate relation extraction algorithm development.


How to Evaluate the Performance of Your Machine Learning Model - KDnuggets

#artificialintelligence

Let me start with a very simple example. Robin and Sam both started preparing for an entrance exam for engineering college. They both shared a room and put an equal amount of hard work while solving numerical problems. They both studied almost the same hours for the entire year and appeared in the final exam. Surprisingly, Robin cleared, but Sam did not.


Classifier Combination Approach for Question Classification for Bengali Question Answering System

arXiv.org Artificial Intelligence

Question classification (QC) is a prime constituent of automated question answering system. The work presented here demonstrates that the combination of multiple models achieve better classification performance than those obtained with existing individual models for the question classification task in Bengali. We have exploited state-of-the-art multiple model combination techniques, i.e., ensemble, stacking and voting, to increase QC accuracy. Lexical, syntactic and semantic features of Bengali questions are used for four well-known classifiers, namely Naïve Bayes, kernel Naïve Bayes, Rule Induction, and Decision Tree, which serve as our base learners. Single-layer question-class taxonomy with 8 coarse-grained classes is extended to two-layer taxonomy by adding 69 fine-grained classes. We carried out the experiments both on single-layer and two-layer taxonomies. Experimental results confirmed that classifier combination approaches outperform single classifier classification approaches by 4.02% for coarse-grained question classes. Overall, the stacking approach produces the best results for fine-grained classification and achieves 87.79% of accuracy. The approach presented here could be used in other Indo-Aryan or Indic languages to develop a question answering system. Both theoretical [22, 23] and empirical [24-26] studies confirm that the classifier combination approach is generally more accurate than any of the individual classifiers making up the ensemble. Furthermore, a number of studies [27, 30] were successfully carried out on classifier combination methods for the QC task which outperformed the individual classifiers.


Fairness-Aware Online Personalization

arXiv.org Artificial Intelligence

Decision making in crucial applications such as lending, hiring, and college admissions has witnessed increasing use of algorithmic models and techniques as a result of a confluence of factors such as ubiquitous connectivity, ability to collect, aggregate, and process large amounts of fine-grained data using cloud computing, and ease of access to applying sophisticated machine learning models. Quite often, such applications are powered by search and recommendation systems, which in turn make use of personalized ranking algorithms. At the same time, there is increasing awareness about the ethical and legal challenges posed by the use of such data-driven systems. Researchers and practitioners from different disciplines have recently highlighted the potential for such systems to discriminate against certain population groups, due to biases in the datasets utilized for learning their underlying recommendation models. We present a study of fairness in online personalization settings involving the ranking of individuals. Starting from a fair warm-start machine-learned model, we first demonstrate that online personalization can cause the model to learn to act in an unfair manner if the user is biased in his/her responses. For this purpose, we construct a stylized model for generating training data with potentially biased features as well as potentially biased labels and quantify the extent of bias that is learned by the model when the user responds in a biased manner as in many real-world scenarios. We then formulate the problem of learning personalized models under fairness constraints and present a regularization based approach for mitigating biases in machine learning. We demonstrate the efficacy of our approach through extensive simulations with different parameter settings. Code: https://github.com/groshanlal/Fairness-Aware-Online-Personalization


Automatic Yara Rule Generation Using Biclustering

arXiv.org Machine Learning

Yara rules are a ubiquitous tool among cybersecurity practitioners and analysts. Developing high-quality Yara rules to detect a malware family of interest can be labor- and time-intensive, even for expert users. Few tools exist and relatively little work has been done on how to automate the generation of Yara rules for specific families. In this paper, we leverage large n-grams ($n \geq 8$) combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently available software. Our method, AutoYara, is fast, allowing for deployment on low-resource equipment for teams that deploy to remote networks. Our results demonstrate that AutoYara can help reduce analyst workload by producing rules with useful true-positive rates while maintaining low false-positive rates, sometimes matching or even outperforming human analysts. In addition, real-world testing by malware analysts indicates AutoYara could reduce analyst time spent constructing Yara rules by 44-86%, allowing them to spend their time on the more advanced malware that current tools can't handle. Code will be made available at https://github.com/NeuromorphicComputationResearchProgram .