Accuracy
AI creates efficiencies in sanctions checking @Euromoney
In transaction banking, the focus on technological development has centred on the possibilities of blockchain technology. However, this has overshadowed the arrival of AI into transaction-banking platforms. AI and machine learning are helping to further reduce manual checks and processes. The first target for implementation is sanctions and compliance. As companies become increasingly international, irrespective of size, checking against sanctions has become an essential activity for more than just the MNCs. AI can learn through experience what can pass through the sanctions filter, and what compliance obligations need to be checked.
Huge US facial recognition database flawed: audit
The FBI's facial recognition database has more than 400 million pictures to help its criminal investigations, but lacks adequate safeguards for accuracy and privacy protection, a congressional audit has revealed. Totalling 411.9 million images, privacy campaigners have slammed the'unprecedented number of photographs, most of which are of Americans and foreigners who have committed no crimes.' The huge database - which enables investigators to automatically search images for criminal suspects - 'is far greater than had previously been understood' and raises concerns'about the risk of innocent Americans being inadvertently swept up in criminal investigations,' said Senator Al Franken, who requested the study. The FBI's facial recognition database includes some 30 million criminal mugshots and 140 million images from visa applications by foreign nationals The FBI's database includes some 30 million criminal mugshots and 140 million images from visa applications by foreign nationals, the GAO found. It also contains drivers' license pictures from 16 US states and 6.7 million photos from the Defense Department's biometric identification system of individuals detained by US forces abroad, among others.
ACDC: $\alpha$-Carving Decision Chain for Risk Stratification
Park, Yubin, Ho, Joyce, Ghosh, Joydeep
In many healthcare settings, intuitive decision rules for risk stratification can help effective hospital resource allocation. This paper introduces a novel variant of decision tree algorithms that produces a chain of decisions, not a general tree. Our algorithm, $\alpha$-Carving Decision Chain (ACDC), sequentially carves out "pure" subsets of the majority class examples. The resulting chain of decision rules yields a pure subset of the minority class examples. Our approach is particularly effective in exploring large and class-imbalanced health datasets. Moreover, ACDC provides an interactive interpretation in conjunction with visual performance metrics such as Receiver Operating Characteristics curve and Lift chart.
No penalty no tears: Least squares in high-dimensional linear models
Wang, Xiangyu, Dunson, David, Leng, Chenlei
Ordinary least squares (OLS) is the default method for fitting linear models, but is not applicable for problems with dimensionality larger than the sample size. For these problems, we advocate the use of a generalized version of OLS motivated by ridge regression, and propose two novel three-step algorithms involving least squares fitting and hard thresholding. The algorithms are methodologically simple to understand intuitively, computationally easy to implement efficiently, and theoretically appealing for choosing models consistently. Numerical exercises comparing our methods with penalization-based approaches in simulations and data analyses illustrate the great potential of the proposed algorithms.
Invincea First Machine Learning Based Endpoint Security Company to Join Anti-Malware Testing Standards Organization (AMTSO(TM))
FAIRFAX, VA--(Marketwired - June 15, 2016) - Invincea, the leader in advanced endpoint threat protection, announced today that it is the first machine learning based endpoint security company to join the Anti-Malware Testing Standards Organization (AMTSO). Participation in AMTSO furthers Invincea's mission of addressing the global need for improvement in third party testing based on scientific objectivity, quality, and relevance of anti-malware testing methodologies. Hundreds of millions of new pieces of malware are created a year, wreaking havoc on enterprises across industries against the backdrop of obsolete anti-malware approaches. To combat the scourge of malware that evades traditional anti-malware systems, the next-gen endpoint security market has exploded with new companies bringing products to market with fantastic claims. To date, these companies have not been held accountable to their marketing claims by independent scientifically valid testing on the merits of their product technology and approaches.
Data Science (Machine Learning) 101
Date Science, or Machine Learning, is a scary topic. It's hard to know where to get started. It's hard to even find a good definition of what it does and what you have to do. As I've given a few ad hoc presentations on Machine Learning (and though focused on implementing it with Azure, the basics are applicable to other platforms) I thought I'd take my random notes and present them as a primer. You don't need to be a Rocket Scientist to get started, but having a basic understanding of Linear Algebra will be helpful.
Variational Inference for On-line Anomaly Detection in High-Dimensional Time Series
Soelch, Maximilian, Bayer, Justin, Ludersdorfer, Marvin, van der Smagt, Patrick
Approximate variational inference has shown to be a powerful tool for modeling unknown complex probability distributions. Recent advances in the field allow us to learn probabilistic models of sequences that actively exploit spatial and temporal structure. We apply a Stochastic Recurrent Network (STORN) to learn robot time series data. Our evaluation demonstrates that we can robustly detect anomalies both off- and on-line.
Online Optimization Methods for the Quantification Problem
Kar, Purushottam, Li, Shuai, Narasimhan, Harikrishna, Chawla, Sanjay, Sebastiani, Fabrizio
The estimation of class prevalence, i.e., the fraction of a population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather estimate the overall distribution of positive and negative sentiments during an event window. A popular way of performing the above task, often dubbed quantification, is to use supervised learning to train a prevalence estimator from labeled data. Contemporary literature cites several performance measures used to measure the success of such prevalence estimators. In this paper we propose the first online stochastic algorithms for directly optimizing these quantification-specific performance measures. We also provide algorithms that optimize hybrid performance measures that seek to balance quantification and classification performance. Our algorithms present a significant advancement in the theory of multivariate optimization and we show, by a rigorous theoretical analysis, that they exhibit optimal convergence. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for these performance measures.
Tuning-Free Heterogeneity Pursuit in Massive Networks
Ren, Zhao, Kang, Yongjian, Fan, Yingying, Lv, Jinchi
Heterogeneity is often natural in many contemporary applications involving massive data. While posing new challenges to effective learning, it can play a crucial role in powering meaningful scientific discoveries through the understanding of important differences among subpopulations of interest. In this paper, we exploit multiple networks with Gaussian graphs to encode the connectivity patterns of a large number of features on the subpopulations. To uncover the heterogeneity of these structures across subpopulations, we suggest a new framework of tuning-free heterogeneity pursuit (THP) via large-scale inference, where the number of networks is allowed to diverge. In particular, two new tests, the chi-based test and the linear functional-based test, are introduced and their asymptotic null distributions are established. Under mild regularity conditions, we establish that both tests are optimal in achieving the testable region boundary and the sample size requirement for the latter test is minimal. Both theoretical guarantees and the tuning-free feature stem from efficient multiple-network estimation by our newly suggested approach of heterogeneous group square-root Lasso (HGSL) for high-dimensional multi-response regression with heterogeneous noises. To solve this convex program, we further introduce a tuning-free algorithm that is scalable and enjoys provable convergence to the global optimum. Both computational and theoretical advantages of our procedure are elucidated through simulation and real data examples.
What happens when your search engine is first to know you have cancer
This week researchers demonstrated that by analyzing a person's Web searches they could in some cases predict an upcoming diagnosis of pancreatic cancer. Unlike traditional medical professionals, they have the advantage of access to a trove of data that Microsoft collects through its search engine, Bing. The Microsoft researchers identified Web users who had recently searched for queries indicating they have pancreatic cancer, such as "I was told I have pancreatic cancer, what to expect," and then looked back months earlier to examine patterns in the symptoms that the users searched for. This included phrases such as "dark or tarry stool," "abdominal swelling," "dark urine" and "yellowing skin." From this analysis they realized trends in the queries of users who were soon to be diagnosed with pancreatic cancer, identifying 5 to 15 percent of cases with low false-positive rates.