to

### Yandex Open-Sources YaLM Model With 100 Billion Parameters

Transformers are used for translation and text summarising tasks because they can analyze sequential input data, such as natural language. Transformers use the self-attention process and weights the importance of each component of the input data differently. Large-scale transformer-based language models have gained a lot of popularity recently in the disciplines of computer vision and natural language processing (NLP). They expand in size and complexity frequently, yet it costs millions of dollars, hires the greatest experts, and takes years to construct these models. Because of this, many companies have been unable to use it, and only significant IT organizations have access to this cutting-edge technology.

### A Unifying, Game-Theoretic Framework for Imitation Learning

IL algorithms can be grouped broadly into (a) online, (b) offline, and (c) interactive methods. We provide, for each setting, performance bounds for learned policies that apply for all algorithms, provably efficient algorithmic templates for achieving said bounds, and practical realizations that out-perform recent work. From beating the world champion at Go (Silver et al.) to getting cars to drive themselves (Bojarski et al.), we've seen unprecedented successes in learning to make sequential decisions over the last few years. When viewed from an algorithmic viewpoint, many of these accomplishments share a common paradigm: imitation learning (IL). In imitation learning, one is given access to samples of expert behavior (e.g.

### Data Drift - Types, causes and measures.

In most big data analysis applications, data evolve over time and must be analyzed and treated in near real time. Patterns and interactions in such data often change over time, thus, models built for analyzing such data quickly become outdated over time. In machine learning and data mining this phenomenon is referred to as data drift. Data Drift Data drift is a change in the distribution of data over time. In machine learning models, data drift is the change in the distribution of a baseline data set on which the model was trained and the current real-time production data.

### KL divergence, JS divergence, and Wasserstein metric in Deep Learning

In this blog, I will continue the discussion of essential probability/statistics concepts by including 3 more concepts, which are widely used in deep learning to measure distances between probabilities distributions.

### Understanding Distance Metrics and Their Significance - DataScienceCentral.com

I saw a nice representation of distance metrics. This topic is not easy to explain. In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points. The concept of statistical distance is also related to that of probabilistic metric spaces where the distances between points are specified by probability distributions rather than numbers. Distances and similarities are measures to describe how close two statistical objects are.

### GANs #004 Variational Autoencoders – in-depth explained

Highlight: In this post, we will be discussing Variational Autoencoders (VAE). In order to fully understand the underlying ideas, we need to have a basic understanding of traditional Autoencoders. Luckily, we have already written about them in our previous posts. This post will consist of several topics. First, we will review autoencoders. Then, we will give some review of basic probability concepts. Next, we will explain what Kullback Leibler divergence is. In addition, we will talk about the loss function and how it can be derived. So, the first concept that we're going to review is autoencoders. Sometimes, we also call them the stacked autoencoders. One application of autoencoders is image compression. The pipeline is commonly presented as a block diagram in the following way. We have an input image that goes into an encoder part. The input can be a simple image like the one from the MNIST data set. As you can see here, it is a digit $$3$$. Once this digit passes through the network, we want to reconstruct the original image at its output as closely as possible. For that, we use the cost function $$L$$. Here, we have two parameters $$\theta$$ and $$\phi$$.

### Auto Evaluate Feature Selection

I hope it's doing great? In statistics, divergence is a function that establishes the "distance" between two probability distributions. In other words, it measures the difference between two distributions. If we interpret these two distributions as a set of observable values, we can measure the distance between them. Bregman divergence is one of many divergences.

### Machine learning highly effective at identifying SARS-CoV-2 variants

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causal agent of the coronavirus disease 2019 (COVID-19) pandemic, is a highly pathogenic coronavirus belonging to the betacoronavirus genus. The genome of SARS-CoV-2 consists of a single-stranded RNA of 29,903 nucleotides. SARS-CoV-2 is associated with a very high mutation rate, and, recently, machine learning has proved to be a valuable method to identify the distinctive genomic signatures among viral sequences. This could be helpful in taxonomic and phylogenetic studies and also help in detecting emerging variants of concern. In a new study posted to the bioRxiv* preprint server, researchers studied KEVOLVE, an approach based on a genetic algorithm with a machine learning kernel, to identify several genomic signatures.

### Generalization Bounds via Convex Analysis

Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.

### Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT

Nowadays, the industrial Internet of Things (IIoT) has played an integral role in Industry 4.0 and produced massive amounts of data for industrial intelligence. These data locate on decentralized devices in modern factories. To protect the confidentiality of industrial data, federated learning (FL) was introduced to collaboratively train shared machine learning models. However, the local data collected by different devices skew in class distribution and degrade industrial FL performance. This challenge has been widely studied at the mobile edge, but they ignored the rapidly changing streaming data and clustering nature of factory devices, and more seriously, they may threaten data security. In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered industries, to improve industrial FL performance on non-i.i.d. data. Taking advantage of naturally clustered factory devices, FedGS uses a gradient-based binary permutation algorithm (GBP-CS) to select a subset of devices within each factory and build homogeneous super nodes participating in FL training. Then, we propose a compound-step synchronization protocol to coordinate the training process within and among these super nodes, which shows great robustness against data heterogeneity. The proposed methods are time-efficient and can adapt to dynamic environments, without exposing confidential industrial data in risky manipulation. We prove that FedGS has better convergence performance than FedAvg and give a relaxed condition under which FedGS is more communication-efficient. Extensive experiments show that FedGS improves accuracy by 3.5% and reduces training rounds by 59% on average, confirming its superior effectiveness and efficiency on non-i.i.d. data.