Performance Analysis
A meta-algorithm for classification using random recursive tree ensembles: A high energy physics application
The aim of this work is to propose a meta-algorithm for automatic classification in the presence of discrete binary classes. Classifier learning in the presence of overlapping class distributions is a challenging problem in machine learning. Overlapping classes are described by the presence of ambiguous areas in the feature space with a high density of points belonging to both classes. This often occurs in real-world datasets, one such example is numeric data denoting properties of particle decays derived from high-energy accelerators like the Large Hadron Collider (LHC). A significant body of research targeting the class overlap problem use ensemble classifiers to boost the performance of algorithms by using them iteratively in multiple stages or using multiple copies of the same model on different subsets of the input training data. The former is called boosting and the latter is called bagging. The algorithm proposed in this thesis targets a challenging classification problem in high energy physics - that of improving the statistical significance of the Higgs discovery. The underlying dataset used to train the algorithm is experimental data built from the official ATLAS full-detector simulation with Higgs events (signal) mixed with different background events (background) that closely mimic the statistical properties of the signal generating class overlap. The algorithm proposed is a variant of the classical boosted decision tree which is known to be one of the most successful analysis techniques in experimental physics. The algorithm utilizes a unified framework that combines two meta-learning techniques - bagging and boosting. The results show that this combination only works in the presence of a randomization trick in the base learners.
An Approach for Time-aware Domain-based Social Influence Prediction
Abu-Salih, Bilal, Chan, Kit Yan, Al-Kadi, Omar, Al-Tawil, Marwan, Wongthongtham, Pornpit, Issa, Tomayess, Saadeh, Heba, Al-Hassan, Malak, Bremie, Bushra, Albahlal, Abdulaziz
Online Social Networks(OSNs) have established virtual platforms enabling people to express their opinions, interests and thoughts in a variety of contexts and domains, allowing legitimate users as well as spammers and other untrustworthy users to publish and spread their content. Hence, the concept of social trust has attracted the attention of information processors/data scientists and information consumers/business firms. One of the main reasons for acquiring the value of Social Big Data (SBD) is to provide frameworks and methodologies using which the credibility of OSNs users can be evaluated. These approaches should be scalable to accommodate large-scale social data. Hence, there is a need for well comprehending of social trust to improve and expand the analysis process and inferring the credibility of SBD. Given the exposed environment's settings and fewer limitations related to OSNs, the medium allows legitimate and genuine users as well as spammers and other low trustworthy users to publish and spread their content. Hence, this paper presents an approach incorporates semantic analysis and machine learning modules to measure and predict users' trustworthiness in numerous domains in different time periods. The evaluation of the conducted experiment validates the applicability of the incorporated machine learning techniques to predict highly trustworthy domain-based users.
BioSNet: A Fast-Learning and High-Robustness Unsupervised Biomimetic Spiking Neural Network
Meng, Mingyuan, Yang, Xingyu, Xiao, Shanlin, Yu, Zhiyi
Spiking Neural Network (SNN), as a brain-inspired machine learning algorithm, is closer to the computing mechanism of human brain and more suitable to reveal the essence of intelligence compared with Artificial Neural Networks (ANN), attracting more and more attention in recent years. In addition, the information processed by SNN is in the form of discrete spikes, which makes SNN have low power consumption characteristics. In this paper, we propose an efficient and strong unsupervised SNN named BioSNet with high biological plausibility to handle image classification tasks. In BioSNet, we propose a new biomimetic spiking neuron model named MRON inspired by 'recognition memory' in the human brain, design an efficient and robust network architecture corresponding to biological characteristics of the human brain as well, and extend the traditional voting mechanism to the Vote-for-All (VFA) decoding layer so as to reduce information loss during decoding. Simulation results show that BioSNet not only achieves state-of-the-art unsupervised classification accuracy on MNIST/EMNIST data sets, but also exhibits superior learning efficiency and high robustness. Specifically, the BioSNet trained with only dozens of samples per class can achieve a favorable classification accuracy over 80% and randomly deleting even 95% of synapses or neurons in the BioSNet only leads to slight performance degradation.
Inference for Network Structure and Dynamics from Time Series Data via Graph Neural Network
Chen, Mengyuan, Zhang, Jiang, Zhang, Zhang, Du, Lun, Hu, Qiao, Wang, Shuo, Zhu, Jiaqi
Network structures in various backgrounds play important roles in social, technological, and biological systems. However, the observable network structures in real cases are often incomplete or unavailable due to measurement errors or private protection issues. Therefore, inferring the complete network structure is useful for understanding complex systems. The existing studies have not fully solved the problem of inferring network structure with partial or no information about connections or nodes. In this paper, we tackle the problem by utilizing time series data generated by network dynamics. We regard the network inference problem based on dynamical time series data as a problem of minimizing errors for predicting future states and proposed a novel data-driven deep learning model called Gumbel Graph Network (GGN) to solve the two kinds of network inference problems: Network Reconstruction and Network Completion. For the network reconstruction problem, the GGN framework includes two modules: the dynamics learner and the network generator. For the network completion problem, GGN adds a new module called the States Learner to infer missing parts of the network. We carried out experiments on discrete and continuous time series data. The experiments show that our method can reconstruct up to 100% network structure on the network reconstruction task. While the model can also infer the unknown parts of the structure with up to 90% accuracy when some nodes are missing. And the accuracy decays with the increase of the fractions of missing nodes. Our framework may have wide application areas where the network structure is hard to obtained and the time series data is rich.
Cyber Attack Detection thanks to Machine Learning Algorithms
Delplace, Antoine, Hermoso, Sheryl, Anandita, Kristofer
Cybersecurity attacks are growing both in frequency and sophistication over the years. This increasing sophistication and complexity call for more advancement and continuous innovation in defensive strategies. Traditional methods of intrusion detection and deep packet inspection, while still largely used and recommended, are no longer sufficient to meet the demands of growing security threats. As computing power increases and cost drops, Machine Learning is seen as an alternative method or an additional mechanism to defend against malwares, botnets, and other attacks. This paper explores Machine Learning as a viable solution by examining its capabilities to classify malicious traffic in a network. First, a strong data analysis is performed resulting in 22 extracted features from the initial Netflow datasets. All these features are then compared with one another through a feature selection process. Then, our approach analyzes five different machine learning algorithms against NetFlow dataset containing common botnets. The Random Forest Classifier succeeds in detecting more than 95% of the botnets in 8 out of 13 scenarios and more than 55% in the most difficult datasets. Finally, insight is given to improve and generalize the results, especially through a bootstrapping technique.
Channels' Confirmation and Predictions' Confirmation: from the Medical Test to the Raven Paradox
After long arguments between positivism and falsificationism, the verification of universal hypotheses was replaced with the confirmation of uncertain major premises. Unfortunately, Hemple discovered the Raven Paradox (RP). Then, Carnap used the logical probability increment as the confirmation measure. So far, many confirmation measures have been proposed. Measure F among them proposed by Kemeny and Oppenheim possesses symmetries and asymmetries proposed by Elles and Fitelson, monotonicity proposed by Greco et al., and normalizing property suggested by many researchers. Based on the semantic information theory, a measure b* similar to F is derived from the medical test. Like the likelihood ratio, b* and F can only indicate the quality of channels or the testing means instead of the quality of probability predictions. And, it is still not easy to use b*, F, or another measure to clarify the RP. For this reason, measure c* similar to the correct rate is derived. The c* has the simple form: (a-c)/max(a, c); it supports the Nicod Criterion and undermines the Equivalence Condition, and hence, can be used to eliminate the RP. Some examples are provided to show why it is difficult to use one of popular confirmation measures to eliminate the RP. Measure F, b*, and c* indicate that fewer counterexamples' existence is more essential than more positive examples' existence, and hence, are compatible with Popper's falsification thought.
Deep Learning Illustrated: Building Natural Language Processing Models
As shown in Example 11.20, we compile our dense sentiment classifier with a line of code that should already be familiar from recent chapters, except that--because we have a single output neuron within a binary classifier--we use binary_crossentropy cost in place of the categorical_crossentropy cost we used for our multiclass MNIST classifiers.
Coronary Artery Disease Diagnosis; Ranking the Significant Features Using Random Trees Model
Joloudari, Javad Hassannataj, Joloudari, Edris Hassannataj, Saadatfar, Hamid, GhasemiGol, Mohammad, Razavi, Seyyed Mohammad, Mosavi, Amir, Nabipour, Narjes, Shamshirband, Shahaboddin, Nadai, Laszlo
Since data collection and analysis are difficult, time consuming and costly, we are always looking for a way to optimum use of data to achieve the correct decision that can be referred to diagnose and experiment of diseases in healthcare organizations [3]. In addition, common method such as angiography [5,6] in experimenting and diagnosing diseases is costly and have adverse effects for patients as healthcare resear chers are trying to utilize methods that avoid the high cost as well as the adverse effects of previous methods, which can be performed by using computer - aided disease diagnose methods means machine learning. Whereas, da ta mining process by utilizing machine learning science and database management knowledge [1] has become a robust tool for data analysis and management of health industry data which ultimately leads to knowledge extraction. It should be noted that, with the progress of technology in t he healthcare especially, healthcare industry 4.0, human lifetime has become progressive and more comfortable [ 7 ] . In this new generation, with the development of new medical devices, equipment and tools, new knowledge can be gained in the field of disease diagnosis.
Smart Data based Ensemble for Imbalanced Big Data Classification
García-Gil, Diego, Holmberg, Johan, García, Salvador, Xiong, Ning, Herrera, Francisco
Big Data scenarios pose a new challenge to traditional data mining algorithms, since they are not prepared to work with such amount of data. Smart Data refers to data of enough quality to improve the outcome from a data mining algorithm. Existing data mining algorithms unability to handle Big Datasets prevents the transition from Big to Smart Data. Automation in data acquisition that characterizes Big Data also brings some problems, such as differences in data size per class. This will lead classifiers to lean towards the most represented classes. This problem is known as imbalanced data distribution, where one class is underrepresented in the dataset. Ensembles of classifiers are machine learning methods that improve the performance of a single base classifier by the combination of several of them. Ensembles are not exempt from the imbalanced classification problem. To deal with this issue, the ensemble method have to be designed specifically. In this paper, a data preprocessing ensemble for imbalanced Big Data classification is presented, with focus on two-class problems. Experiments carried out in 21 Big Datasets have proved that our ensemble classifier outperforms classic machine learning models with an added data balancing method, such as Random Forests.
On Model Evaluation under Non-constant Class Imbalance
Brabec, Jan, Komárek, Tomáš, Franc, Vojtěch, Machlica, Lukáš
Many real-world classification problems are significantly class-imbalanced to detriment of the class of interest. The standard set of proper evaluation metrics is well-known but the usual assumption is that the test dataset imbalance equals the real-world imbalance. In practice, this assumption is often broken for various reasons. The reported results are then often too optimistic and may lead to wrong conclusions about industrial impact and suitability of proposed techniques. We introduce methods focusing on evaluation under non-constant class imbalance. We show that not only the absolute values of commonly used metrics, but even the order of classifiers in relation to the evaluation metric used is affected by the change of the imbalance rate. Finally, we demonstrate that using subsampling in order to get a test dataset with class imbalance equal to the one observed in the wild is not necessary, and eventually can lead to significant errors in classifier's performance estimate.