Performance Analysis
Giuliano Liguori on LinkedIn: #BigData #Analytics #DataScience
The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable. K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. The Naive Bayes classification algorithm is a probabilistic classifier.
Building Transparency Into AI Projects - AI Summary
That means communicating why an AI solution was chosen, how it was designed and developed, on what grounds it was deployed, how it's monitored and updated, and the conditions under which it may be retired. There are four specific effects of building in transparency: 1) it decreases the risk of error and misuse, 2) it distributes responsibility, 3) it enables internal and external oversight, and 4) it expresses respect for people. In 2018, one of the largest tech companies in the world premiered an AI that called restaurants and impersonated a human to make reservations. To "prove" it was human, the company trained the AI to insert "umms" and "ahhs" into its request: for instance, "When would I like the reservation? If the product team doesn't explain how to properly handle the outputs of the model, introducing AI can be counterproductive in high-stakes situations. In designing the model, the data scientists reasonably thought that erroneously marking an x-ray as negative when in fact, the x-ray does show a cancerous tumor can have very dangerous consequences and so they set a low tolerance for false negatives and, thus, a high tolerance for false positives. Had they been properly informed -- had the design decision been made transparent to the end-user -- the radiologists may have thought, I really don't see anything here and I know the AI is overly sensitive, so I'm going to move on. By being transparent from start to finish, genuine accountability can be distributed among all as they are given the knowledge they need to make responsible decisions. Consider, for instance, a financial advisor who hides the existence of some investment opportunities and emphasizes the potential upsides of others because he gets a larger commission when he sells the latter. The more general point is that AI can undermine people's autonomy -- their ability to see the options available to them and to choose among them without undue influence or manipulation. That means communicating why an AI solution was chosen, how it was designed and developed, on what grounds it was deployed, how it's monitored and updated, and the conditions under which it may be retired. There are four specific effects of building in transparency: 1) it decreases the risk of error and misuse, 2) it distributes responsibility, 3) it enables internal and external oversight, and 4) it expresses respect for people. In 2018, one of the largest tech companies in the world premiered an AI that called restaurants and impersonated a human to make reservations. To "prove" it was human, the company trained the AI to insert "umms" and "ahhs" into its request: for instance, "When would I like the reservation?
A Comprehensive Survey on the Cyber-Security of Smart Grids: Cyber-Attacks, Detection, Countermeasure Techniques, and Future Directions
Khoei, Tala Talaei, Slimane, Hadjar Ould, Kaabouch, Naima
One of the significant challenges that smart grid networks face is cyber-security. Several studies have been conducted to highlight those security challenges. However, the majority of these surveys classify attacks based on the security requirements, confidentiality, integrity, and availability, without taking into consideration the accountability requirement. In addition, some of these surveys focused on the Transmission Control Protocol/Internet Protocol (TCP/IP) model, which does not differentiate between the application, session, and presentation and the data link and physical layers of the Open System Interconnection (OSI) model. In this survey paper, we provide a classification of attacks based on the OSI model and discuss in more detail the cyber-attacks that can target the different layers of smart grid networks communication. We also propose new classifications for the detection and countermeasure techniques and describe existing techniques under each category. Finally, we discuss challenges and future research directions.
Sharp Constants in Uniformity Testing via the Huber Statistic
Uniformity testing is one of the most well-studied problems in property testing, with many known test statistics, including ones based on counting collisions, singletons, and the empirical TV distance. It is known that the optimal sample complexity to distinguish the uniform distribution on $m$ elements from any $\epsilon$-far distribution with $1-\delta$ probability is $n = \Theta\left(\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2} + \frac{\log (1/\delta)}{\epsilon^2}\right)$, which is achieved by the empirical TV tester. Yet in simulation, these theoretical analyses are misleading: in many cases, they do not correctly rank order the performance of existing testers, even in an asymptotic regime of all parameters tending to $0$ or $\infty$. We explain this discrepancy by studying the \emph{constant factors} required by the algorithms. We show that the collisions tester achieves a sharp maximal constant in the number of standard deviations of separation between uniform and non-uniform inputs. We then introduce a new tester based on the Huber loss, and show that it not only matches this separation, but also has tails corresponding to a Gaussian with this separation. This leads to a sample complexity of $(1 + o(1))\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2}$ in the regime where this term is dominant, unlike all other existing testers.
PAC-Wrap: Semi-Supervised PAC Anomaly Detection
Li, Shuo, Ji, Xiayan, Dobriban, Edgar, Sokolsky, Oleg, Lee, Insup
Anomaly detection is essential for preventing hazardous outcomes for safety-critical applications like autonomous driving. Given their safety-criticality, these applications benefit from provable bounds on various errors in anomaly detection. To achieve this goal in the semi-supervised setting, we propose to provide Probably Approximately Correct (PAC) guarantees on the false negative and false positive detection rates for anomaly detection algorithms. Our method (PAC-Wrap) can wrap around virtually any existing semi-supervised and unsupervised anomaly detection method, endowing it with rigorous guarantees. Our experiments with various anomaly detectors and datasets indicate that PAC-Wrap is broadly effective.
Machine Learning on a Large Scale
The ROC curve is also used in order to compute the area under the ROC curve metric. The ROC curve of a perfect model will approach the top-left corner, whilst a random model will approach the diagonal (True positive rate False positive rate). The area under the ROC curve ranges between 0. and 1 and can be computed via a BinaryClassificationEvaluator object The result is impressive, despite the attempt to hamper the model quality. The area under the ROC curve for the training set can be obtained from the model summary lr_model.summary.areaUnderROC. The BinaryClassificationEvaluator object can also be used to compute the area under the PR curve.
Python for Machine Learning: A Tutorial
Python has become the most popular data science and machine learning programming language. But in order to obtain effective data and results, it's important that you have a basic understanding of how it works with machine learning. In this introductory tutorial, you'll learn the basics of Python for machine learning, including different model types and the steps to take to ensure you obtain quality data, using a sample machine learning problem. In addition, you'll get to know some of the most popular libraries and tools for machine learning. Machine learning (ML) is a form of artificial intelligence (AI) that teaches computers to make predictions and recommendations and solve problems based on data. Its problem-solving capabilities make it a useful tool in industries such as financial services, healthcare, marketing and sales, and education among others. There are three main types of machine learning: supervised, unsupervised, and reinforcement. In supervised learning, the computer is given a set of training data that includes both the input data (what we want to predict) and the output data (the prediction).
Particle Transformer for Jet Tagging
Qu, Huilin, Li, Congqiao, Qian, Sitian
Jet tagging is a critical yet challenging classification task in particle physics. While deep learning has transformed jet tagging and significantly improved performance, the lack of a large-scale public dataset impedes further enhancement. In this work, we present JetClass, a new comprehensive dataset for jet tagging. The JetClass dataset consists of 100 M jets, about two orders of magnitude larger than existing public datasets. A total of 10 types of jets are simulated, including several types unexplored for tagging so far. Based on the large dataset, we propose a new Transformer-based architecture for jet tagging, called Particle Transformer (ParT). By incorporating pairwise particle interactions in the attention mechanism, ParT achieves higher tagging performance than a plain Transformer and surpasses the previous state-of-the-art, ParticleNet, by a large margin. The pre-trained ParT models, once fine-tuned, also substantially enhance the performance on two widely adopted jet tagging benchmarks. The dataset, code and models are publicly available at https://github.com/jet-universe/particle_transformer.
From Modeling to Scoring: Correcting Predicted Class Probabilities in Imbalanced Datasets
Model evaluation is an important part of a data science project and it's exactly this part that quantifies how good your model is, how much it has improved from the previous version, how much better it is than your colleague's model, and how much room for improvement there still is. It is not unusual in machine learning applications to deal with imbalanced datasets such as fraud detection, computer network intrusion, medical diagnostics, and many more. Data imbalance refers to unequal distribution of classes within a dataset, namely that there are far fewer events in one class in comparison to the others. If, for example we have credit card fraud detection dataset, most of the transactions are not fraudulent and very few can be classed as fraud detections. This underrepresented class is called the minority class, and by convention, the positive class.
Random projections and Kernelised Leave One Cluster Out Cross-Validation: Universal baselines and evaluation tools for supervised machine learning for materials properties
Durdy, Samantha, Gaultois, Michael, Gusev, Vladimir, Bollegala, Danushka, Rosseinsky, Matthew J.
With machine learning being a popular topic in current computational materials science literature, creating representations for compounds has become common place. These representations are rarely compared, as evaluating their performance - and the performance of the algorithms that they are used with - is non-trivial. With many materials datasets containing bias and skew caused by the research process, leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials. This raises the question of the impact, and control, of the range of cluster sizes on the LOCO-CV measurement outcomes. We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to better separate data to enhance LOCO-CV applications. We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception. We also find that the radial basis function improves the linear separability of chemical datasets in all 10 datasets tested and provide a framework for the application of this function in the LOCO-CV process to improve the outcome of LOCO-CV measurements regardless of machine learning algorithm, choice of metric, and choice of compound representation. We recommend kernelised LOCO-CV as a training paradigm for those looking to measure the extrapolatory power of an algorithm on materials data.