Collaborating Authors


"Isolation Forest": The Anomaly Detection Algorithm Any Data Scientist Should Know


"Isolation Forest" is a brilliant algorithm for anomaly detection born in 2009 (here is the original paper). It has since become very popular: it is also implemented in Scikit-learn (see the documentation). In this article, we will appreciate the beauty in the intuition behind this algorithm and understand how exactly it works under the hood, with the aid of some examples. Anomaly (or outlier) detection is the task of identifying data points that are "very strange" compared to the majority of observations. This is useful in a range of applications, from fault detection to discovery of financial frauds, from finding health issues to identifying unsatisfied customers. Moreover, it can also be beneficial for machine learning pipelines, since it has been proven that removing outliers leads to an increase in model accuracy.

Big Data Exchange enters Indonesian data centre market with joint venture deal


Eileen Yu began covering the IT industry when Asynchronous Transfer Mode was still hip and e-commerce was the new buzzword. Currently an independent business technology journalist and content specialist based in Singapore, she has over 20 years of industry experience with various publications including ZDNet, IDG, and Singapore Press Holdings. Big Data Exchange (BDx) has marked its entry into Indonesia's data centre market through a joint venture agreement with PT Indosat and the latter's two subsidiaries. The move aims to tap increasing demand for cloud services and connectivity. Estimated to be worth $300 million, the deal would see BDx enter a conditional sale and purchase agreement of shares (CSPA) and establish a joint venture with PT Indosat, PT Aplikanusa Lintasarta, and PT Starone Mitra Telekomunikasi (SMT). Under the agreement, BDx, Indosat, and Lintasarta would set up data centre and cloud operations in the Asian market, BDx said in a statement Thursday.

Data Analysis Method: Mathematics Optimization to Build Decision Making -


Optimization is a problem associated with the best decision that is effective and efficient decisions whether it is worth maximum or minimum by way of determining a satisfactory solution. Optimization is not a new science. It has grown even since Newton in the 17th century discovered how to count roots. Currently the science of optimization is still evolving in terms of techniques and applications. Many cases or problems in everyday life that involve optimization to solve them.

Kaggle - Get The Best Data Science, Machine Learning Profile


Welcome to " Kaggle - Get Best Profile in Data Science & Machine Learning " course. Kaggle is Machine Learning & Data Science community. Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. Machine learning is constantly being applied to new industries and new problems. Whether you're a marketer, video game designer, or programmer, Oak Academy has a course to help you apply machine learning to your work. It's hard to imagine our lives without machine learning.

17 top business process management tools for 2022


Business process management is now a mature discipline. It has formal approaches, methods, techniques and a rich set of concepts. It has also evolved to the point where it is applied to projects of all sizes and supports both business process improvement and business transformation. As BPM evolved, so did the enterprise's business processes. They became too large and complex to be managed without automated tool support.

GitHub - ml-tooling/best-of-ml-python: 🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.


A ranked list of awesome machine learning Python libraries. This curated list contains 920 awesome open-source projects with a total of 3.4M stars grouped into 34 categories. All projects are ranked by a project-quality score, which is calculated based on various metrics automatically collected from GitHub and different package managers. If you like to add or update projects, feel free to open an issue, submit a pull request, or directly edit the projects.yaml. Discover other best-of lists or create your own.

Graph data science: What you need to know


We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Whether you're genuinely interested in getting insights and solving problems using data, or just attracted by what has been called "the most promising career" by LinkedIn and the "best job in America" by Glassdoor, chances are you're familiar with data science. As we've elaborated previously, graphs are a universal data structure with manifestations that span a wide spectrum: from analytics to databases, and from knowledge management to data science, machine learning and even hardware. Graph data science is when you want to answer questions, not just with your data, but with the connections between your data points -- that's the 30-second explanation, according to Alicia Frame. Frame is the senior director of product management for data science at Neo4j, a leading graph database vendor.

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

Communications of the ACM

Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes--computing nodes that unpredictably slow down or fail--is a major bottleneck in such distributed computations. Ideal load balancing strategies that dynamically allocate more tasks to faster nodes require knowledge or monitoring of node speeds as well as the ability to quickly move data. Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes, thus resulting in a lot of redundant computation. We propose a rateless fountain coding strategy that achieves the best of both worlds--we prove that its latency is asymptotically equal to ideal load balancing, and it performs asymptotically zero redundant computations. Our idea is to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than m row-vector products are collectively finished by the nodes. Evaluation on parallel and distributed computing yields as much as three times speedup over uncoded schemes. Matrix-vector multiplications form the core of a plethora of scientific computing and machine learning applications that include solving partial differential equations, forward and back propagation in neural networks, computing the PageRank of graphs, etcetera. In the age of Big Data, most of these applications involve multiplying extremely large matrices and vectors and the computations cannot be performed efficiently on a single machine. This has motivated the development of several algorithms that seek to speed up matrix-vector multiplication by distributing the computation across multiple computing nodes.

Two Paths for Digital Disability Law

Communications of the ACM

People with disabilities often cannot count on modern digital devices, software, and services to be accessible. Will streaming video platforms include closed captions for viewers who are deaf or hard of hearing? How will virtual assistants work for users with speech disabilities? Can websites be read aloud by text-to-speech engines for readers who are blind or visually impaired? How will smartphones be accessed by people with physical and mobility disabilities?

ACM's 2022 General Election

Communications of the ACM

The ACM constitution provides that our Association hold a general election in the even-numbered years for the positions of President, Vice President, Secretary/Treasurer, and Members-at-Large. Biographical information and statements of the candidates appear on the following pages (candidates' names appear in random order). In addition to the election of ACM's officers--President, Vice President, Secretary/Treasurer--two Members-at-Large will be elected to serve on ACM Council. The 2022 candidates for ACM President, Yannis Ioannidis and Joseph A. Konstan, are working together to solicit and answer questions from the computing community! Please refer to the instructions posted at Please note the election email will be addressed from Please return your ballot in the enclosed envelope, which must be signed by you on the outside in the space provided. The signed ballot envelope may be inserted into a separate envelope for mailing if you prefer this method. All ballots must be received by no later than 16:00 UTC on 23 May 2022. Validation by the Elections Committee will take place at 14:00 UTC on 25 May 2022. Yannis Ioannidis is Professor of Informatics & Telecom at the U. of Athens, Greece (since 1997). Prior to that, he was a professor of Computer Sciences at the U. of Wisconsin-Madison (1986-1997).