Goto

Collaborating Authors

 Performance Analysis


Clustering and Learning from Imbalanced Data

arXiv.org Machine Learning

A learning classifier must outperform a trivial solution, in case of imbalanced data, this condition usually does not hold true. To overcome this problem, we propose a novel data level resampling method - Clustering Based Oversampling for improved learning from class imbalanced datasets. The essential idea behind the proposed method is to use the distance between a minority class sample and its respective cluster centroid to infer the number of new sample points to be generated for that minority class sample. The proposed algorithm has very less dependence on the technique used for finding cluster centroids and does not effect the majority class learning in any way. It also improves learning from imbalanced data by incorporating the distribution structure of minority class samples in generation of new data samples. The newly generated minority class data is handled in a way as to prevent outlier production and overfitting. Implementation analysis on different datasets using deep neural networks as the learning classifier shows the effectiveness of this method as compared to other synthetic data resampling techniques across several evaluation metrics.


Learning From Positive and Unlabeled Data: A Survey

arXiv.org Machine Learning

Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.


Detection of REM Sleep Behaviour Disorder by Automated Polysomnography Analysis

arXiv.org Machine Learning

Evidence suggests Rapid-Eye-Movement (REM) Sleep Behaviour Disorder (RBD) is an early predictor of Parkinson's disease. This study proposes a fully-automated framework for RBD detection consisting of automated sleep staging followed by RBD identification. Analysis was assessed using a limited polysomnography montage from 53 participants with RBD and 53 age-matched healthy controls. Sleep stage classification was achieved using a Random Forest (RF) classifier and 156 features extracted from electroencephalogram (EEG), electrooculogram (EOG) and electromyogram (EMG) channels. For RBD detection, a RF classifier was trained combining established techniques to quantify muscle atonia with additional features that incorporate sleep architecture and the EMG fractal exponent. Automated multi-state sleep staging achieved a 0.62 Cohen's Kappa score. RBD detection accuracy improved by 10% to 96% (compared to individual established metrics) when using manually annotated sleep staging. Accuracy remained high (92%) when using automated sleep staging. This study outperforms established metrics and demonstrates that incorporating sleep architecture and sleep stage transitions can benefit RBD detection. This study also achieved automated sleep staging with a level of accuracy comparable to manual annotation. This study validates a tractable, fully-automated, and sensitive pipeline for RBD identification that could be translated to wearable take-home technology.


Machine Learning with Abstention for Automated Liver Disease Diagnosis

arXiv.org Machine Learning

This paper presents a novel approach for detection of liver abnormalities in an automated manner using ultrasound images. For this purpose, we have implemented a machine learning model that can not only generate labels (normal and abnormal) for a given ultrasound image but it can also detect when its prediction is likely to be incorrect. The proposed model abstains from generating the label of a test example if it is not confident about its prediction. Such behavior is commonly practiced by medical doctors who, when given insufficient information or a difficult case, can chose to carry out further clinical or diagnostic tests before generating a diagnosis. However, existing machine learning models are designed in a way to always generate a label for a given example even when the confidence of their prediction is low. We have proposed a novel stochastic gradient based solver for the learning with abstention paradigm and use it to make a practical, state of the art method for liver disease classification. The proposed method has been benchmarked on a data set of approximately 100 patients from MINAR, Multan, Pakistan and our results show that the proposed scheme offers state of the art classification performance.


Why Is Data Science Different than Software Development? It Starts with Data…Lots o' DATA!!

#artificialintelligence

Data science development is very different from software development, and getting the two to mesh is sometimes like trying to cobble together Tinker Toys with Lincoln Logs. Software development is "Measure twice; cut once," while Data Science is "Cut, cut, cut!" The methodologies and processes that support successful software development do not work for data science projects according to one simple observation: software development knows, with 100% assurance, the expected outcomes, while data science – through data exploration and hypothesis testing, failing and learning – discoversthose outcomes. First introduced in the blog "What's The Difference Between BI Analyst and Data Scientist?", the Data Science Engagement methodology in Figure 1 supports the rapid exploration, rapid testing, and continuous learning Data Science "Scientific Method[1]". Let's review each of these in more detail.


Predicting Adverse Media Risk using a Heterogeneous Information Network

arXiv.org Machine Learning

The media plays a central role in monitoring powerful institutions and identifying any activities harmful to the public interest. In the investing sphere constituted of 46,583 officially listed domestic firms on the stock exchanges worldwide, there is a growing interest `to do the right thing', i.e., to put pressure on companies to improve their environmental, social and government (ESG) practices. However, how to overcome the sparsity of ESG data from non-reporting firms, and how to identify the relevant information in the annual reports of this large universe? Here, we construct a vast heterogeneous information network that covers the necessary information surrounding each firm, which is assembled using seven professionally curated datasets and two open datasets, resulting in about 50 million nodes and 400 million edges in total. Exploiting this heterogeneous information network, we propose a model that can learn from past adverse media coverage patterns and predict the occurrence of future adverse media coverage events on the whole universe of firms. Our approach is tested using the adverse media coverage data of more than 35,000 firms worldwide from January 2012 to May 2018. Comparing with state-of-the-art methods with and without the network, we show that the predictive accuracy is substantially improved when using the heterogeneous information network. This work suggests new ways to consolidate the diffuse information contained in big data in order to monitor dominant institutions on a global scale for more socially responsible investment, better risk management, and the surveillance of powerful institutions.


Design Rule Violation Hotspot Prediction Based on Neural Network Ensembles

arXiv.org Machine Learning

Abstract--Design rule check is a critical step in the physical design of integrated circuits to ensure manufacturability. However, it can be done only after a time-consuming detailed routing procedure, which adds drastically to the time of design iterations. With advanced technology nodes, the outcomes of global routing and detailed routing become less correlated, which adds to the difficulty of predicting design rule violations from earlier stages. In this paper, a framework based on neural network ensembles is proposed to predict design rule violation hotspots using information from placement and global routing. A soft voting structure and a PCA-based subset selection scheme are developed on top of a baseline neural network from a recent work. Experimental results show that the proposed architecture achieves significant improvement in model performance compared to the baseline case. For half of test cases, the performance is even better than random forest, a commonly-used ensemble learning model. Today's IC fabrication technologies require satisfying many complex design rules to ensure manufacturability.


How Do Fairness Definitions Fare? Examining Public Attitudes Towards Algorithmic Definitions of Fairness

arXiv.org Artificial Intelligence

What is the best way to define algorithmic fairness? There has been much recent debate on algorithmic fairness. While many definitions of fairness have been proposed in the computer science literature, there is no clear agreement over a particular definition. In this work, we investigate ordinary people's perceptions of three of these fairness definitions. Across two online experiments, we test which definitions people perceive to be the fairest in the context of loan decisions, and whether those fairness perceptions change with the addition of sensitive information (i.e., race of the loan applicants). We find a clear preference for one definition, and the general results seem to align with the principle of affirmative action.


Explainable cardiac pathology classification on cine MRI with motion characterization by semi-supervised learning of apparent flow

arXiv.org Artificial Intelligence

We propose a method to classify cardiac pathology based on a novel approach to extract image derived features to characterize the shape and motion of the heart. An original semi-supervised learning procedure, which makes efficient use of a large amount of non-segmented images and a small amount of images segmented manually by experts, is developed to generate pixel-wise apparent flow between two time points of a 2D+t cine MRI image sequence. Combining the apparent flow maps and cardiac segmentation masks, we obtain a local apparent flow corresponding to the 2D motion of myocardium and ventricular cavities. This leads to the generation of time series of the radius and thickness of myocardial segments to represent cardiac motion. These time series of motion features are reliable and explainable characteristics of pathological cardiac motion. Furthermore, they are combined with shape-related features to classify cardiac pathologies. Using only nine feature values as input, we propose an explainable, simple and flexible model for pathology classification. On ACDC training set and testing set, the model achieves 95% and 94% respectively as classification accuracy. Its performance is hence comparable to that of the state-of-the-art. Comparison with various other models is performed to outline some advantages of our model.


Blood test can spot DNA from eight different types of cancer

New Scientist

A simple blood test can detect eight different types of cancer. It does this by detecting the various sizes of tumour DNA fragments that flow through the body. At the moment, most cancer screening tools are limited to specific areas of the body – for example, mammograms for spotting breast cancer and faecal tests for detecting bowel cancer. Whole-body MRI and CT scans can identify tumours throughout the body, but only once they have grown large enough to see. As a result, many research groups are working on developing blood tests that can detect multiple different cancer types while they are still in early, treatable stages.