Performance Analysis
Exploring the Semantic Content of Unsupervised Graph Embeddings: An Empirical Study
Bonner, Stephen, Kureshi, Ibad, Brennan, John, Theodoropoulos, Georgios, McGough, Andrew Stephen, Obara, Boguslaw
Graph embeddings have become a key and widely used technique within the field of graph mining, proving to be successful across a broad range of domains including social, citation, transportation and biological. Graph embedding techniques aim to automatically create a low-dimensional representation of a given graph, which captures key structural elements in the resulting embedding space. However, to date, there has been little work exploring exactly which topological structures are being learned in the embeddings process. In this paper, we investigate if graph embeddings are approximating something analogous with traditional vertex level graph features. If such a relationship can be found, it could be used to provide a theoretical insight into how graph embedding approaches function. We perform this investigation by predicting known topological features, using supervised and unsupervised methods, directly from the embedding space. If a mapping between the embeddings and topological features can be found, then we argue that the structural information encapsulated by the features is represented in the embedding space. To explore this, we present extensive experimental evaluation from five state-of-the-art unsupervised graph embedding techniques, across a range of empirical graph datasets, measuring a selection of topological features. We demonstrate that several topological features are indeed being approximated by the embedding space, allowing key insight into how graph embeddings create good representations.
Outcome-Oriented Predictive Process Monitoring: Review and Benchmark
Teinemaa, Irene, Dumas, Marlon, La Rosa, Marcello, Maggi, Fabrizio Maria
Traditional process monitoring techniques provide dashboards and reports showing the recent performance of a business process in terms of key performance indicators such as mean execution time, resource utilization or error rate with respect to a given notion of error. Predictive (business) process monitoring techniques go beyond traditional ones by making predictions about the future state of the executions of a business process (herein called cases). For example, a predictive monitoring technique may seek to predict the remaining execution time of each ongoing case of a process [29], the next activity that will be executed in each case [11], or the final outcome of a case, with respect to a possible set of business outcomes [23-25]. For instance, in an order-to-cash process (a process going from the receipt of a purchase order to the receipt of payment of the corresponding invoice), the possible outcomes of a case may be that the purchase order is closed satisfactorily (i.e., the customer accepted the products and paid) or unsatisfactorily (e.g., the order was canceled or withdrawn). Another set of possible outcomes is that the products were delivered on time (with respect to a maximum acceptable delivery time), or delivered late. Recent years have seen the emergence of a rich field of proposed methods for predictive process monitoring in general, and predictive monitoring of (categorical) case outcomes in particular - herein called outcome-oriented predictive process monitoring. Unfortunately, there is no unified approach to evaluate these methods. Indeed, different authors have used different datasets, experimental settings, evaluation measures and baselines.
Microsoft weeds out fake marketing leads with Naรฏve Bayes and Machine Learning Server
To connect with potential customers, our marketers and sellers at Microsoft depend on good-quality leads. But sometimes people fill out online forms with fake names, gibberish, or even profanity. We distinguish fake company names from legitimate names in our data using the programming language R, the Naive Bayes classifier algorithm, Microsoft Machine Learning Server, and a data quality service that we built. This solution helps us weed out fake names and prioritize good leads for our sales and marketing teams.
MultiFIT: Multivariate Multiscale Framework for Independence Tests
We present a framework for testing independence between two random vectors that is scalable to massive data. Taking a "divide-and-conquer" approach, we break down the nonparametric multivariate test of independence into simple univariate independence tests on a collection of $2\times 2$ contingency tables, constructed by sequentially discretizing the original sample space at a cascade of scales from coarse to fine. This transforms a complex nonparametric testing problem---that traditionally requires quadratic computational complexity with respect to the sample size---into a multiple testing problem that can be addressed with a computational complexity that scales almost linearly with the sample size. We further consider the scenario when the dimensionality of the two random vectors also grows large, in which case the curse of dimensionality arises in the proposed framework through an explosion in the number of univariate tests to be completed. To overcome this difficulty, we propose a data-adaptive version of our method that completes a fraction of the univariate tests, judged to be more likely to contain evidence for dependency based on exploiting the spatial characteristics of the dependency structure in the data. We provide an inference recipe based on multiple testing adjustment that guarantees the inferential validity in terms of properly controlling the family-wise error rate. We demonstrate the tremendous computational advantage of the algorithm in comparison to existing approaches while achieving desirable statistical power through an extensive simulation study. In addition, we illustrate how our method can be used for learning the nature of the underlying dependency in addition to hypothesis testing. We demonstrate the use of our method through analyzing a data set from flow cytometry.
Evaluating and Characterizing Incremental Learning from Non-Stationary Data
Cervantes, Alejandro, Gagnรฉ, Christian, Isasi, Pedro, Parizeau, Marc
Incremental learning from non-stationary data poses special challenges to the field of machine learning. Although new algorithms have been developed for this, assessment of results and comparison of behaviors are still open problems, mainly because evaluation metrics, adapted from more traditional tasks, can be ineffective in this context. Overall, there is a lack of common testing practices. This paper thus presents a testbed for incremental non-stationary learning algorithms, based on specially designed synthetic datasets. Also, test results are reported for some well-known algorithms to show that the proposed methodology is effective at characterizing their strengths and weaknesses. It is expected that this methodology will provide a common basis for evaluating future contributions in the field.
Cross-validation Tutorial: What, how and which?
"Statistics [from cross-validation] are like bikinis. Training set Test set 2 4. P. Raamana Goals for Today โข What is cross-validation? Training set Test set โต 2 5. P. Raamana Goals for Today โข What is cross-validation? Training set Test set โต 2 6. Training set Test set โต negative bias unbiased positive bias 2 7. P. Raamana What is generalizability? Training set Test set 5 18. Training set Test set bigger training set better learning 5 19. Training set Test set bigger training set better learning better testing bigger test set 5 20. Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint.
Why Won't Facebook Talk About How Often Its Algorithms Are Wrong?
Two weeks ago Facebook released yet another glossy marketing infographic site and video touting how its state of the art technology, top engineers and teams of experts have made massive strides in conquering yet another scourge of the online world through the power of advanced algorithms. This past week its EMEA counterterrorism lead announced that its algorithms were now deleting 99% of all ISIS and al-Qaida terrorism content across the site. As with all of Facebook's announcements to date, neither of these proclamations made any mention of how often the algorithms that increasingly control its platform are wrong and whether they are actually right more often than they are wrong. After initially promising to provide a response, the company once again declined to comment on the false positive rates of its algorithms or why despite repeated requests it continues to refuse to release those numbers. Why is the company so afraid to talk about whether its algorithms are actually accurate?
Binary Classification in Unstructured Space With Hypergraph Case-Based Reasoning
Binary classification is one of the most common problem in machine learning. It consists in predicting whether a given element is of a particular class. In this paper, a new algorithm for binary classification is proposed using a hypergraph representation. Each element to be classified is partitioned according to its interactions with the training set. For each class, the total support is calculated as a convex combination of the {\it evidence} strength of the element of the partition. The evidence measure is pre-computed using the hypergraph induced by the training set and iteratively adjusted through a training phase. It does not require structured information, each case being represented by a set of {\it agnostic information} atoms. Empirical validation demonstrates its high potential on a wide range of well-known datasets and the results are compared to the state-of-art. The time complexity is given and empirically validated. Its capacity to provide good performances without hyperparameter tuning compared to standard classification methods is studied. Finally, the limitation of the model space is discussed and some potential solutions proposed.
WWE Money In The Bank 2018: Predictions, Match Card, Preview For Wrestling PPV
There's a lot on the line at WWE Money in the Bank 2018, which has clearly become the most important pay-per-view that isn't among the "Big Four." Not only will five titles be defended, but world championship opportunities are also up for grabs Sunday night. Money in the Bank features 10 matches, including Ronda Rousey's first singles match, two ladder matches and a last man standing match. Styles and Nakamura have been feuding for the better part of three months. WWE has had plenty of chances to put the WWE Championship on the Japanese superstar, yet Styles has continued to hold the belt, even with Nakamura's heel turn.
Machine learning "red dot": open-source, cloud, deep convolutional neural networks in chest radiograph binary normality classification. - PubMed - NCBI
To develop a machine learning-based model for the binary classification of chest radiography abnormalities, to serve as a retrospective tool in guiding clinician reporting prioritisation. The open-source machine learning library, Tensorflow, was used to retrain a final layer of the deep convolutional neural network, Inception, to perform binary normality classification on two, anonymised, public image datasets. Re-training was performed on 47,644 images using commodity hardware, with validation testing on 5,505 previously unseen radiographs. Confusion matrix analysis was performed to derive diagnostic utility metrics. This study demonstrates the application of a machine learning-based approach to classify chest radiographs as normal or abnormal.