to

Hardness of Learning Halfspaces with Massart Noise

We study the complexity of PAC learning halfspaces in the presence of Massart (bounded) noise. Specifically, given labeled examples $(x, y)$ from a distribution $D$ on $\mathbb{R}^{n} \times \{ \pm 1\}$ such that the marginal distribution on $x$ is arbitrary and the labels are generated by an unknown halfspace corrupted with Massart noise at rate $\eta<1/2$, we want to compute a hypothesis with small misclassification error. Characterizing the efficient learnability of halfspaces in the Massart model has remained a longstanding open problem in learning theory. Recent work gave a polynomial-time learning algorithm for this problem with error $\eta+\epsilon$. This error upper bound can be far from the information-theoretically optimal bound of $\mathrm{OPT}+\epsilon$. More recent work showed that {\em exact learning}, i.e., achieving error $\mathrm{OPT}+\epsilon$, is hard in the Statistical Query (SQ) model. In this work, we show that there is an exponential gap between the information-theoretically optimal error and the best error that can be achieved by a polynomial-time SQ algorithm. In particular, our lower bound implies that no efficient SQ algorithm can approximate the optimal error within any polynomial factor.

Notes on Deep Learning Theory

These are the notes for the lectures that I was giving during Fall 2020 at the Moscow Institute of Physics and Technology (MIPT) and at the Yandex School of Data Analysis (YSDA). The notes cover some aspects of initialization, loss landscape, generalization, and a neural tangent kernel theory. While many other topics (e.g. expressivity, a mean-field theory, a double descent phenomenon) are missing in the current version, we plan to add them in future revisions.

On Irrelevant Literals in Pseudo-Boolean Constraint Learning

Learning pseudo-Boolean (PB) constraints in PB solvers exploiting cutting planes based inference is not as well understood as clause learning in conflict-driven clause learning solvers. In this paper, we show that PB constraints derived using cutting planes may contain \emph{irrelevant literals}, i.e., literals whose assigned values (whatever they are) never change the truth value of the constraint. Such literals may lead to infer constraints that are weaker than they should be, impacting the size of the proof built by the solver, and thus also affecting its performance. This suggests that current implementations of PB solvers based on cutting planes should be reconsidered to prevent the generation of irrelevant literals. Indeed, detecting and removing irrelevant literals is too expensive in practice to be considered as an option (the associated problem is NP-hard.

Mapping Network States Using Connectivity Queries

Can we infer all the failed components of an infrastructure network, given a sample of reachable nodes from supply nodes? One of the most critical post-disruption processes after a natural disaster is to quickly determine the damage or failure states of critical infrastructure components. However, this is non-trivial, considering that often only a fraction of components may be accessible or observable after a disruptive event. Past work has looked into inferring failed components given point probes, i.e. with a direct sample of failed components. In contrast, we study the harder problem of inferring failed components given partial information of some `serviceable' reachable nodes and a small sample of point probes, being the first often more practical to obtain. We formulate this novel problem using the Minimum Description Length (MDL) principle, and then present a greedy algorithm that minimizes MDL cost effectively. We evaluate our algorithm on domain-expert simulations of real networks in the aftermath of an earthquake. Our algorithm successfully identify failed components, especially the critical ones affecting the overall system performance.

Data Science & Machine Learning(Theory+Projects)A-Z 90 HOURS

Electrification was, without a doubt, the greatest engineering marvel of the 20th century. The electric motor was invented way back in 1821, and the electrical circuit was mathematically analyzed in 1827. But factory electrification, household electrification, and railway electrification all started slowly several decades later. The field of AI was formally founded in 1956. But it's only now--more than six decades later--that AI is expected to revolutionize the way humanity will live and work in the coming decades.

GpuShareSat: a SAT solver using the GPU for clause sharing

We describe a SAT solver using both the GPU (CUDA) and the CPU with a new clause exchange strategy. The CPU runs a classic multithreaded CDCL SAT solver. EachCPU thread exports all the clauses it learns to the GPU. The GPU makes a heavy usage of bitwise operations. It notices when a clause would have been used by a CPU thread and notifies that thread, in which case it imports that clause. This relies on the GPU repeatedly testing millions of clauses against hundreds of assignments. All the clauses are tested independantly from each other (which allows the GPU massively parallel approach), but against all the assignments at once, using bitwise operations. This allows CPU threads to only import clauses which would have been useful for them. Our solver is based upon glucose-syrup. Experiments show that this leads to a strong performance improvement, with 22 more instances solved on the SAT 2020 competition than glucose-syrup.

Detecting Hierarchical Changes in Latent Variable Models

This paper addresses the issue of detecting hierarchical changes in latent variable models (HCDL) from data streams. There are three different levels of changes for latent variable models: 1) the first level is the change in data distribution for fixed latent variables, 2) the second one is that in the distribution over latent variables, and 3) the third one is that in the number of latent variables. It is important to detect these changes because we can analyze the causes of changes by identifying which level a change comes from (change interpretability). This paper proposes an information-theoretic framework for detecting changes of the three levels in a hierarchical way. The key idea to realize it is to employ the MDL (minimum description length) change statistics for measuring the degree of change, in combination with DNML (decomposed normalized maximum likelihood) code-length calculation. We give a theoretical basis for making reliable alarms for changes. Focusing on stochastic block models, we employ synthetic and benchmark datasets to empirically demonstrate the effectiveness of our framework in terms of change interpretability as well as change detection.

You can't eliminate bias from machine learning, but you can pick your bias

Bias is a major topic of concern in mainstream society, which has embraced the concept that certain characteristics -- race, gender, age, or zip code, for example -- should not matter when making decisions about things such as credit or insurance. But while an absence of bias makes sense on a human level, in the world of machine learning, it's a bit different. In machine learning theory, if you can mathematically prove you don't have any bias and if you find the optimal model, the value of the model actually diminishes because you will not be able to make generalizations. What this tells us is that, as unfortunate as it may sound, without any bias built into the model, you cannot learn. Modern businesses want to use machine learning and data mining to make decisions based on what their data tells them, but the very nature of that inquiry is discriminatory.

Noise in Classification

Machine learning studies automatic methods for making accurate predictions and useful decisions based on previous observations and experience. From the application point of view, machine learning has become a successful discipline for operating in complex domains such as natural language processing, speech recognition, and computer vision. Moreover, the theoretical foundations of machine learning have led to the development of powerful and versatile techniques, which are routinely used in a wide range of commercial systems in today's world. However, a major challenge of increasing importance in the theory and practice of machine learning is to provide algorithms that are robust to adversarial noise. In this chapter, we focus on classification where the goal is to learn a classification rule from labeled examples only.

Stable Sample Compression Schemes: New Applications and an Optimal SVM Margin Bound

We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular, we prove a new margin bound for SVM, removing a log factor. The new bound is provably optimal.