Collaborating Authors

Statistical Learning

Top 10 Must-Know Machine Learning Algorithms for Data Scientists – Part 1 - KDnuggets


Wading through the vast array of information for data science newcomers on machine learning algorithms can be a difficult and time-consuming process. Figuring out which algorithms are widely used and which are simply novel or interesting is not just an academic exercise; determining where to concentrate your time and focus during the early days of study can determine the difference between getting your career off to a quick start and experiencing an extended ground delay. How, exactly, does one discriminate between immediately useful algorithms worthy of attention and, well, not so much? Determining how to come up with an objective list of machine learning algorithms of authority is inherently difficult, but it seems that going directly to practitioners for their feedback may be the optimal approach. Such a process presents a whole host of difficulties of its own, as could be easily imagined, and results of such surveys are few and far between.

Build XGBoost models with Amazon Redshift ML


Amazon Redshift ML allows data analysts, developers, and data scientists to train machine learning (ML) models using SQL. In previous posts, we demonstrated how customers can use the automatic model training capability of Amazon Redshift to train their classification and regression models. Redshift ML provides several capabilities for data scientists. It allows you to create a model using SQL and specify your algorithm as XGBoost. It also lets you bring your pre-trained XGBoost model into Amazon Redshift for local inference.

Kruskal Wallis test in R-One-way ANOVA Alternative


Kruskal Wallis test in R, Kruskal Wallis test is one of the frequently used methods in nonparametric statistics for analyzing data in one-way classification. It is equivalent to a one-way analysis of variance in parametric methods. When we test the identicalness of the k population from which the independent samples have been drawn. There is no restriction of sample sizes. Mainly Kruskal Wallis test is based on the following assumptions.

Timescale scales out and sets its sights on analytics


Fresh off a recently-announced $40 million B round of funding, Timescale is diversifying its TimescaleDB platform with a couple of goals: making it more scalable and adding a new analytics engine. As we noted when we discussed the general release of Amazon Timestream last fall, time series platforms are an old-but-suddenly-new category in the database landscape. Although IoT is often cited (or blamed) for the upshot in time series database activity, there are numerous scenarios (e.g., in capital markets, transportation and logistics, etc.) where time is the defining parameter. But let's get this confession off our chests right now: TimescaleDB is a brand name that is easily confused with Amazon Timestream (OK, Timescale came out in the market first). As a result, we very often find ourselves tripping over all this nearly identical branding and found ourselves entering global replace mode to make sure we put the right names in the right sentences.

Complete Machine Learning & Data Science with Python


Machine learning is constantly being applied to new industries and new problems. Whether you're a marketer, video game designer, or programmer, my course on Udemy here to help you apply machine learning to your work. Welcome to the "Complete Machine Learning & Data Science with Python A-Z" course. Do you know data science needs will create 11.5 million job openings by 2026? Do you know the average salary is $100.000 for data science careers!

Top 10 Data Science Projects for Beginners - KDnuggets


As an aspiring data scientist, you must have heard the advice "do data science projects" over a thousand times. Not only are data science projects a great learning experience, they also help you stand out from the crowd of data science enthusiasts looking to break into the field. In this article, I am going to walk you through the projects that are must-haves on your resume. I will also provide you with sample datasets to experiment with for each project, along with associated tutorials that will help you complete the project. Data collection and pre-processing is one of the most important skills to have as a data scientist.

Top 5 Data Science Projects with Source Code to kick-start your Career - DataFlair


Are you a Data Science aspirant and looking forward to some challenging and real-time Data Science projects? Then you are at the right place to gain mastery in the field of Data Science. In this article, we will discuss the best Data Science projects that will boost your knowledge, skills and your Data Science career too!! These real-world Data Science projects with source code offer you a propitious way to gain hands-on experience and start your journey with your dream Data Science job. Now let's quickly jump to our best Data Science project examples with source code.

Out-and-Out in Artificial Neural Networks with Keras


When I started reading articles on neural networks, I faced a lot of struggles to understand the basics behind neural networks and how they work. Start reading more and more articles on the internet, grab those key points, and put them together into private notes for me. And, I thought to publish them for better understandings to others. It would be fun to know the basics of any domain. The perceptron is one of the simplest ANN Architectures, invented in 1957 by Frank Rosenblatt.

Simple Linear Regression: A layman's explanation


Machine learning and statistics have many applications in business and the social sciences. However, the theory is often intimidating and not easily understood. In this series of articles, I aim to demystify the concepts behind the common tools used in data science and machine learning, starting with linear regression. Linear regression is a statistical method that allows us to describe relationships between variables (distinct things that can be measured or recorded, such as height, weight, and hair colour). It is an extension of the General Linear Model, a framework to describe how a variable of interest can be modelled using other predictor variables. In simple linear regression (SLR), we focus on the relationship between two continuous variables, x and y (hence, simple).

Communication Algorithm-Architecture Co-Design for Distributed Deep Learning


Abstract--Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used allreduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MULTITREE all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MULTITREE achieves 2.3 and 1.56 communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.