Goto

Collaborating Authors

Results


Improving Specificity in Mammography Using Cross-correlation between Wavelet and Fourier Transform

arXiv.org Artificial Intelligence

Breast cancer is in the most common malignant tumor in women. It accounted for 30% of new malignant tumor cases. Although the incidence of breast cancer remains high around the world, the mortality rate has been continuously reduced. This is mainly due to recent developments in molecular biology technology and improved level of comprehensive diagnosis and standard treatment. Early detection by mammography is an integral part of that. The most common breast abnormalities that may indicate breast cancer are masses and calcifications. Previous detection approaches usually obtain relatively high sensitivity but unsatisfactory specificity. We will investigate an approach that applies the discrete wavelet transform and Fourier transform to parse the images and extracts statistical features that characterize an image's content, such as the mean intensity and the skewness of the intensity. A naive Bayesian classifier uses these features to classify the images. We expect to achieve an optimal high specificity.


Beyond Visual Image: Automated Diagnosis of Pigmented Skin Lesions Combining Clinical Image Features with Patient Data

arXiv.org Artificial Intelligence

Among the most common types of skin cancer are basal cell carcinoma, squamous cell carcinoma and melanoma. According to the who (2018), currently, between 2 and 3 million non-melanoma skin cancers and 132.000 melanoma skin cancer occur every year in the world. Melanoma is by far the most dangerous form of skin cancer, causing more than 75% of all skin cancer deaths (Allen, 2016). Early diagnosis of the disease plays an important role in reducing the mortality rate with a chance of cure greater than 90% (SBD, 2018). The diagnosis of pigmented skin lesions (PSLs) can be made by invasive and non-invasive methods. One of the most common non-invasive methods was presented by Soyer et al. (1987). The method allows the visualization of morphological structures not visible to the naked eye with the use of an instrument called dermatoscope. When compared to the clinical diagnosis, the use of dermatoscope by experts makes the diagnosis of PSLs easier, increasing by 10-27% the diagnostic sensitivity (Mayer et al., 1997).


Robust Wavelet-based Assessment of Scaling with Applications

arXiv.org Machine Learning

A number of approaches have dealt with statistical assessment of self-similarity, and many of those are based on multiscale concepts. Most rely on certain distributional assumptions which are usually violated by real data traces, often characterized by large temporal or spatial mean level shifts, missing values or extreme observations. A novel, robust approach based on Theil-type weighted regression is proposed for estimating self-similarity in two-dimensional data (images). The method is compared to two traditional estimation techniques that use wavelet decompositions; ordinary least squares (OLS) and Abry-Veitch bias correcting estimator (AV). As an application, the suitability of the self-similarity estimate resulting from the the robust approach is illustrated as a predictive feature in the classification of digitized mammogram images as cancerous or non-cancerous. The diagnostic employed here is based on the properties of image backgrounds, which is typically an unused modality in breast cancer screening. Classification results show nearly 68% accuracy, varying slightly with the choice of wavelet basis, and the range of multiresolution levels used.


Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties

arXiv.org Artificial Intelligence

In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.


Classification of high-dimensional data with spiked covariance matrix structure

arXiv.org Machine Learning

We study the classification problem for high-dimensional data with $n$ observations on $p$ features where the $p \times p$ covariance matrix $\Sigma$ exhibits a spiked eigenvalues structure and the vector $\zeta$, given by the difference between the whitened mean vectors, is sparse with sparsity at most $s$. We propose an adaptive classifier (adaptive with respect to the sparsity $s$) that first performs dimension reduction on the feature vectors prior to classification in the dimensionally reduced space, i.e., the classifier whitened the data, then screen the features by keeping only those corresponding to the $s$ largest coordinates of $\zeta$ and finally apply Fisher linear discriminant on the selected features. Leveraging recent results on entrywise matrix perturbation bounds for covariance matrices, we show that the resulting classifier is Bayes optimal whenever $n \rightarrow \infty$ and $s \sqrt{n^{-1} \ln p} \rightarrow 0$. Experimental results on real and synthetic data sets indicate that the proposed classifier is competitive with existing state-of-the-art methods while also selecting a smaller number of features.


Towards Personalized and Human-in-the-Loop Document Summarization

arXiv.org Artificial Intelligence

The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.


Ensemble Learning Based Classification Algorithm Recommendation

arXiv.org Artificial Intelligence

Recommending appropriate algorithms to a classification problem is one of the most challenging issues in the field of data mining. The existing algorithm recommendation models are generally constructed on only one kind of meta-features by single learners. Considering that i) ensemble learners usually show better performance and ii) different kinds of meta-features characterize the classification problems in different viewpoints independently, and further the models constructed with different sets of meta-features will be complementary with each other and applicable for ensemble. This paper proposes an ensemble learning-based algorithm recommendation method. To evaluate the proposed recommendation method, extensive experiments with 13 well-known candidate classification algorithms and five different kinds of meta-features are conducted on 1090 benchmark classification problems. The results show the effectiveness of the proposed ensemble learning based recommendation method.


Bounded Fuzzy Possibilistic Method of Critical Objects Processing in Machine Learning

arXiv.org Artificial Intelligence

Unsatisfying accuracy of learning methods is mostly caused by omitting the influence of important parameters such as membership assignments, type of data objects, and distance or similarity functions. The proposed method, called Bounded Fuzzy Possibilistic Method (BFPM) addresses different issues that previous clustering or classification methods have not sufficiently considered in their membership assignments. In fuzzy methods, the object's memberships should sum to 1. Hence, any data object may obtain full membership in at most one cluster or class. Possibilistic methods relax this condition, but the method can be satisfied with the results even if just an arbitrary object obtains the membership from just one cluster, which prevents the objects' movement analysis. Whereas, BFPM differs from previous fuzzy and possibilistic approaches by removing these restrictions. Furthermore, BFPM provides the flexible search space for objects' movement analysis. Data objects are also considered as fundamental keys in learning methods, and knowing the exact type of objects results in providing a suitable environment for learning algorithms. The Thesis introduces a new type of object, called critical, as well as categorizing data objects into two different categories: structural-based and behavioural-based. Critical objects are considered as causes of miss-classification and miss-assignment in learning procedures. The Thesis also proposes new methodologies to study the behaviour of critical objects with the aim of evaluating objects' movements (mutation) from one cluster or class to another. The Thesis also introduces a new type of feature, called dominant, that is considered as one of the causes of miss-classification and miss-assignments. Then the Thesis proposes new sets of similarity functions, called Weighted Feature Distance (WFD) and Prioritized Weighted Feature Distance (PWFD).


The MCC-F1 curve: a performance evaluation technique for binary classification

arXiv.org Machine Learning

Many fields use the ROC curve and the PR curve as standard evaluations of binary classification methods. Analysis of ROC and PR, however, often gives misleading and inflated performance evaluations, especially with an imbalanced ground truth. Here, we demonstrate the problems with ROC and PR analysis through simulations, and propose the MCC-F1 curve to address these drawbacks. The MCC-F1 curve combines two informative single-threshold metrics, MCC and the F1 score. The MCC-F1 curve more clearly differentiates good and bad classifiers, even with imbalanced ground truths. We also introduce the MCC-F1 metric, which provides a single value that integrates many aspects of classifier performance across the whole range of classification thresholds. Finally, we provide an R package that plots MCC-F1 curves and calculates related metrics.


Multiclass Disease Predictions Based on Integrated Clinical and Genomics Datasets

arXiv.org Machine Learning

Clinical predictions using clinical data by computational methods are common in bioinformatics. However, clinical predictions using information from genomics datasets as well is not a frequently observed phenomenon in research. Precision medicine research requires information from all available datasets to provide intelligent clinical solutions. In this paper, we have attempted to create a prediction model which uses information from both clinical and genomics datasets. We have demonstrated multiclass disease predictions based on combined clinical and genomics datasets using machine learning methods. We have created an integrated dataset, using a clinical (ClinVar) and a genomics (gene expression) dataset, and trained it using instance-based learner to predict clinical diseases. We have used an innovative but simple way for multiclass classification, where the number of output classes is as high as 75. We have used Principal Component Analysis for feature selection. The classifier predicted diseases with 73\% accuracy on the integrated dataset. The results were consistent and competent when compared with other classification models. The results show that genomics information can be reliably included in datasets for clinical predictions and it can prove to be valuable in clinical diagnostics and precision medicine.