Performance Analysis
Efficient online learning for large-scale peptide identification
Liang, Xijun, Xia, Zhonghang, Wang, Yongxiang, Jian, Ling, Niu, Xinnan, Link, Andrew
Motivation: Post-database searching is a key procedure in peptide dentification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with an extremely large proportion of false positives (hard datasets). A more efficient learning strategy is required for improving the performance of peptide identification on challenging datasets. Results: In this work, we present an online learning method to conquer the challenges remained for exiting peptide identification algorithms. We propose a cost-sensitive learning model by using different loss functions for decoy and target PSMs respectively. A larger penalty for wrongly selecting decoy PSMs than that for target PSMs, and thus the new model can reduce its false discovery rate on hard datasets. Also, we design an online learning algorithm, OLCS-Ranker, to solve the proposed learning model. Rather than taking all training data samples all at once, OLCS-Ranker iteratively feeds in only one training sample into the learning model at each round. As a result, the memory requirement is significantly reduced for large-scale problems. Experimental studies show that OLCS-Ranker outperforms benchmark methods, such as CRanker and Batch-CS-Ranker, in terms of accuracy and stability. Furthermore, OLCS-Ranker is 15--85 times faster than CRanker method on large datasets. Availability and implementation: OLCS-Ranker software is available at no charge for non-commercial use at https://github.com/Isaac-QiXing/CRanker.
Fighting Accounting Fraud Through Forensic Data Analytics
Jofre, Maria, Gerlach, Richard
Accounting fraud is a global concern representing a significant threat to the financial system stability due to the resulting diminishing of the market confidence and trust of regulatory authorities. Several tricks can be used to commit accounting fraud, hence the need for non-static regulatory interventions that take into account different fraudulent patterns. Accordingly, this study aims to improve the detection of accounting fraud via the implementation of several machine learning methods to better differentiate between fraud and non-fraud companies, and to further assist the task of examination within the riskier firms by evaluating relevant financial indicators. Out-of-sample results suggest there is a great potential in detecting falsified financial statements through statistical modelling and analysis of publicly available accounting information. The proposed methodology can be of assistance to public auditors and regulatory agencies as it facilitates auditing processes, and supports more targeted and effective examinations of accounting reports.
40 Questions to test a data scientist on Machine Learning [Solution: SkillPower – Machine Learning, DataFest 2017]
Machine Learning is one of the most sought after skills these days. If you are a data scientist, then you need to be good at Machine Learning – no two ways about it. As part of DataFest 2017, we organized various skill tests so that data scientists can assess themselves on these critical skills. These tests included Machine Learning, Deep Learning, Time Series problems and Probability. This article will lay out the solutions to the machine learning skill test. If you missed out on any of the above skill tests, you can still check out the questions and answers through the articles linked above. In Machine Learning skill test, more than 1350 people registered for the test.
Predicting Graph Categories from Structural Properties
Canning, James P., Ingram, Emma E., Nowak-Wolff, Sammantha, Ortiz, Adriana M., Ahmed, Nesreen K., Rossi, Ryan A., Schmitt, Karl R. B., Soundarajan, Sucheta
Complex networks are often categorized according to the underlying phenomena that they represent such as molecular interactions, re-tweets, and brain activity. In this work, we investigate the problem of predicting the category (domain) of arbitrary networks. This includes complex networks from different domains as well as synthetically generated graphs from five different network models. A classification accuracy of $96.6\%$ is achieved using a random forest classifier with both real and synthetic networks. This work makes two important findings. First, our results indicate that complex networks from various domains have distinct structural properties that allow us to predict with high accuracy the category of a new previously unseen network. Second, synthetic graphs are trivial to classify as the classification model can predict with near-certainty the network model used to generate it. Overall, the results demonstrate that networks drawn from different domains (and network models) are trivial to distinguish using only a handful of simple structural properties.
Companies Need To Start Reporting What AI Gets Wrong, Not Just What It Does Right
In just a few years, deep learning-powered AI and other forms of machine learning have exploded from niche tools into the underlying basis of nearly every major modern online platform. Yet, even as these algorithms increasingly wield near-absolute power over what we see and say online, we have precious little visibility into how they function. If AI tools can robustly prevent hate speech and terrorist recruiting, flag false news and delete financial scams, they will be a welcome addition to our online lives. On the other hand, without any visibility into how often they get things wrong, we have little reason to trust their successes. AI and machine learning have become ubiquitous on the modern web, powering everything from security scanning to content moderation.
Bayesian Regularization for Graphical Models with Unequal Shrinkage
Gan, Lingrui, Narisetty, Naveen N., Liang, Feng
We consider a Bayesian framework for estimating a high-dimensional sparse precision matrix, in which adaptive shrinkage and sparsity are induced by a mixture of Laplace priors. Besides discussing our formulation from the Bayesian standpoint, we investigate the MAP (maximum a posteriori) estimator from a penalized likelihood perspective that gives rise to a new non-convex penalty approximating the $\ell_0$ penalty. Optimal error rates for estimation consistency in terms of various matrix norms along with selection consistency for sparse structure recovery are shown for the unique MAP estimator under mild conditions. For fast and efficient computation, an EM algorithm is proposed to compute the MAP estimator of the precision matrix and (approximate) posterior probabilities on the edges of the underlying sparse structure. Through extensive simulation studies and a real application to a call center data, we have demonstrated the fine performance of our method compared with existing alternatives.
Facial recognition tech used by UK police is making a ton of mistakes
At the end of each summer for the last 14 years, the small Welsh town of Porthcawl has been invaded. Every year its 16,000 population is swamped by up to 35,000 Elvis fans. Many people attending the yearly festival look the same: they slick back their hair, throw on oversized sunglasses and don white flares. At 2017's Elvis festival, impersonators were faced with something different. Police were trialling automated facial recognition technology to track down criminals.
Learning to Represent Programs with Graphs
Allamanis, Miltiadis, Brockschmidt, Marc, Khademi, Mahmoud
Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code's known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures. In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.
What Mark Zuckerberg Gets Wrong--and Right--About Hate Speech
When he testified before Congress last month, Facebook CEO Mark Zuckerberg discussed the problem of using artificial intelligence to identify online hate speech. He said he was optimistic that in five to 10 years, "We will have AI tools that can get into some of the linguistic nuances of different types of content to be more accurate in flagging content for our systems, but today we're not just there on that." Brittan Heller (@brittanheller) is director of the Anti-Defamation League's Center for Technology and Society and works with social media companies to reduce cyberhate and online harassment. As an expert on hate speech who recently developed an AI-based system to study online hate, I can confidently say that Zuckerberg is both right and wrong. He is right that AI is not a panacea, since hate speech relies on nuances that algorithms cannot fully detect. At the same time, just because AI does not solve the problem entirely doesn't mean it's useless.
[P] The unreasonable usefulness of deep learning in building and cleaning medical image datasets • r/MachineLearning
One thing I find weird is that we have lots of discussion of deep learning in complex detection and recognition tasks, but very few people talk about how useful deep learning can be for simple but time consuming image data processing tasks, particularly in medical research. In this post I spend a bit of time cleaning up the CXR14 dataset, and in 4 hours find 430 images with various problems that shouldn't be in the dataset (a csv identifying these images is included in the post). While the prevalence of these problems is super low ( 50/100,000), since the visual challenge is very easy the models can achieve absurdly low false positive rates. I even get an AUROC of 1.0 in a 2000 image validation set on one task:) In doing so, cleaning this dataset to remove 3 different problems didn't take me weeks to pore through each image, but under a day. Certainly nothing in the post is technically groundbreaking, but it is hopefully a prompt to consider deep learning when you are doing time consuming processing.