Collaborating Authors


What is Machine Learning? A Primer for the Epidemiologist


Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods. Machine learning is a branch of computer science that broadly aims to enable computers to "learn" without being directly programmed (1). It has origins in the artificial intelligence movement of the 1950s and emphasizes practical objectives and applications, particularly prediction and optimization. Computers "learn" in machine learning by improving their performance at tasks through "experience" (2, p. xv). In practice, "experience" usually means fitting to data; hence, there is not a clear boundary between machine learning and statistical approaches. Indeed, whether a given methodology is considered "machine learning" or "statistical" often reflects its history as much as genuine differences, and many algorithms (e.g., least absolute shrinkage and selection operator (LASSO), stepwise regression) may or may not be considered machine learning depending on who you ask. Still, despite methodological similarities, machine learning is philosophically and practically distinguishable. At the liberty of (considerable) oversimplification, machine learning generally emphasizes predictive accuracy over hypothesis-driven inference, usually focusing on large, high-dimensional (i.e., having many covariates) data sets (3, 4). Regardless of the precise distinction between approaches, in practice, machine learning offers epidemiologists important tools. In particular, a growing focus on "Big Data" emphasizes problems and data sets for which machine learning algorithms excel while more commonly used statistical approaches struggle. This primer provides a basic introduction to machine learning with the aim of providing readers a foundation for critically reading studies based on these methods and a jumping-off point for those interested in using machine learning techniques in epidemiologic research.

Estimating epidemiologic dynamics from cross-sectional viral load distributions


During the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, polymerase chain reaction (PCR) tests were generally reported only as binary positive or negative outcomes. However, these test results contain a great deal more information than that. As viral load declines exponentially, the PCR cycle threshold (Ct) increases linearly. Hay et al. developed an approach for extracting epidemiological information out of the Ct values obtained from PCR tests used in surveillance for a variety of settings (see the Perspective by Lopman and McQuade). Although there are challenges to relying on single Ct values for individual-level decision-making, even a limited aggregation of data from a population can inform on the trajectory of the pandemic. Therefore, across a population, an increase in aggregated Ct values indicates that a decline in cases is occurring. Science , abh0635, this issue p. [eabh0635][1]; see also abj4185, p. [280][2] ### INTRODUCTION Current approaches to epidemic monitoring rely on case counts, test positivity rates, and reported deaths or hospitalizations. These metrics, however, provide a limited and often biased picture as a result of testing constraints, unrepresentative sampling, and reporting delays. Random cross-sectional virologic surveys can overcome some of these biases by providing snapshots of infection prevalence but currently offer little information on the epidemic trajectory without sampling across multiple time points. ### RATIONALE We develop a new method that uses information inherent in cycle threshold (Ct) values from reverse transcription quantitative polymerase chain reaction (RT-qPCR) tests to robustly estimate the epidemic trajectory from multiple or even a single cross section of positive samples. Ct values are related to viral loads, which depend on the time since infection; Ct values are generally lower when the time between infection and sample collection is short. Despite variation across individuals, samples, and testing platforms, Ct values provide a probabilistic measure of time since infection. We find that the distribution of Ct values across positive specimens at a single time point reflects the epidemic trajectory: A growing epidemic will necessarily have a high proportion of recently infected individuals with high viral loads, whereas a declining epidemic will have more individuals with older infections and thus lower viral loads. Because of these changing proportions, the epidemic trajectory or growth rate should be inferable from the distribution of Ct values collected in a single cross section, and multiple successive cross sections should enable identification of the longer-term incidence curve. Moreover, understanding the relationship between sample viral loads and epidemic dynamics provides additional insights into why viral loads from surveillance testing may appear higher for emerging viruses or variants and lower for outbreaks that are slowing, even absent changes in individual-level viral kinetics. ### RESULTS Using a mathematical model for population-level viral load distributions calibrated to known features of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) viral load kinetics, we show that the median and skewness of Ct values in a random sample change over the course of an epidemic. By formalizing this relationship, we demonstrate that Ct values from a single random cross section of virologic testing can estimate the time-varying reproductive number of the virus in a population, which we validate using data collected from comprehensive SARS-CoV-2 testing in long-term care facilities. Using a more flexible approach to modeling infection incidence, we also develop a method that can reliably estimate the epidemic trajectory in even more-complex populations, where interventions may be implemented and relaxed over time. This method performed well in estimating the epidemic trajectory in the state of Massachusetts using routine hospital admissions RT-qPCR testing data—accurately replicating estimates from other sources for the entire state. ### CONCLUSION This work provides a new method for estimating the epidemic growth rate and a framework for robust epidemic monitoring using RT-qPCR Ct values that are often simply discarded. By deploying single or repeated (but small) random surveillance samples and making the best use of the semiquantitative testing data, we can estimate epidemic trajectories in real time and avoid biases arising from nonrandom samples or changes in testing practices over time. Understanding the relationship between population-level viral loads and the state of an epidemic reveals important implications and opportunities for interpreting virologic surveillance data. It also highlights the need for such surveillance, as these results show how to use it most informatively. ![Figure][3] Ct values reflect the epidemic trajectory and can be used to estimate incidence. ( A and B ) Whether an epidemic has rising or falling incidence will be reflected in the distribution of times since infection (A), which in turn affects the distribution of Ct values in a surveillance sample (B). ( C ) These values can be used to assess whether the epidemic is rising or falling and estimate the incidence curve. Estimating an epidemic’s trajectory is crucial for developing public health responses to infectious diseases, but case data used for such estimation are confounded by variable testing practices. We show that the population distribution of viral loads observed under random or symptom-based surveillance—in the form of cycle threshold (Ct) values obtained from reverse transcription quantitative polymerase chain reaction testing—changes during an epidemic. Thus, Ct values from even limited numbers of random samples can provide improved estimates of an epidemic’s trajectory. Combining data from multiple such samples improves the precision and robustness of this estimation. We apply our methods to Ct values from surveillance conducted during the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic in a variety of settings and offer alternative approaches for real-time estimates of epidemic trajectories for outbreak management and response. [1]: /lookup/doi/10.1126/science.abh0635 [2]: /lookup/doi/10.1126/science.abj4185 [3]: pending:yes

The World of Reality, Causality and Real Artificial Intelligence: Exposing the Great Unknown Unknowns


"All men by nature desire to know." - Aristotle "He who does not know what the world is does not know where he is." - Marcus Aurelius "If I have seen further, it is by standing on the shoulders of giants." "The universe is a giant causal machine. The world is "at the bottom" governed by causal algorithms. Our bodies are causal machines. Our brains and minds are causal AI computers". The 3 biggest unknown unknowns are described and analyzed in terms of human intelligence and machine intelligence. A deep understanding of reality and its causality is to revolutionize the world, its science and technology, AI machines including. The content is the intro of Real AI Project Confidential Report: How to Engineer Man-Machine Superintelligence 2025: AI for Everything and Everyone (AI4EE). It is all a power set of {known, unknown; known unknown}, known knowns, known unknowns, unknown knowns, and unknown unknowns, like as the material universe's material parts: about 4.6% of baryonic matter, about 26.8% of dark matter, and about 68.3% of dark energy. There are a big number of sciences, all sorts and kinds, hard sciences and soft sciences. But what we are still missing is the science of all sciences, the Science of the World as a Whole, thus making it the biggest unknown unknowns. It is what man/AI does not know what it does not know, neither understand, nor aware of its scope and scale, sense and extent. "the universe consists of objects having various qualities and standing in various relationships" (Whitehead, Russell), "the world is the totality of states of affairs" (D. "World of physical objects and events, including, in particular, biological beings; World of mental objects and events; World of objective contents of thought" (K. How the world is still an unknown unknown one could see from the most popular lexical ontology, WordNet,see supplement. The construct of the world is typically missing its essential meaning, "the world as a whole", the world of reality, the ultimate totality of all worlds, universes, and realities, beings, things, and entities, the unified totalities. The world or reality or being or existence is "all that is, has been and will be". Of which the physical universe and cosmos is a key part, as "the totality of space and times and matter and energy, with all causative fundamental interactions".

Topic Modeling and Latent Dirichlet Allocation(LDA) using Gensim and Sklearn : Part 1 - Analytics Vidhya


Let's say you have a client who has a publishing house. Your client comes to you with two tasks: one he wants to categorize all the books or the research papers he receives weekly on a common theme or a topic and the other task is to encapsulate large documents into smaller bite-sized texts. Is there any technique and tool available that can do both of these two tasks? We enter the world of Topic Modeling. I'll break this article into three parts.

The Physics of Energy-Based Models


The interactive version of this post can be found here. Since Medium does not support Javascript and equations written in Latex, we recommend to check out our interactive post as well. When we think of Peter, his positive nature and his can-do attitude inevitably come to mind. He had an ability to inspire and carry the people around him, convincing them that almost everything is possible. Back in 2018 he came up with the idea for an interactive blog post on the subject of Energy Based Models and pitched it to us. It was almost impossible to resist his enthusiastic manner and of course we embarked on this adventure with him. Equipped with our knowledge in physics and machine learning and hardly any clue about Javascript, we started our journey. After a lot of hard work, we were finally able to complete this project. We are overjoyed with the final result and we wish he could see it.

On fine-tuning of Autoencoders for Fuzzy rule classifiers Artificial Intelligence

Recent discoveries in Deep Neural Networks are allowing researchers to tackle some very complex problems such as image classification and audio classification, with improved theoretical and empirical justifications. This paper presents a novel scheme to incorporate the use of autoencoders in Fuzzy rule classifiers (FRC). Autoencoders when stacked can learn the complex non-linear relationships amongst data, and the proposed framework built towards FRC can allow users to input expert knowledge to the system. This paper further introduces four novel fine-tuning strategies for autoencoders to improve the FRC's classification and rule reduction performance. The proposed framework has been tested across five real-world benchmark datasets. Elaborate comparisons with over 15 previous studies, and across 10-fold cross validation performance, suggest that the proposed methods are capable of building FRCs which can provide state of the art accuracies.

Would a robot trust you? Developmental robotics model of trust and theory of mind


The technological revolution taking place in the fields of robotics and artificial intelligence seems to indicate a future shift in our human-centred social paradigm towards a greater inclusion of artificial cognitive agents in our everyday environments. This means that collaborative scenarios between humans and robots will become more frequent and will have a deeper impact on everyday life. In this setting, research regarding trust in human–robot interactions (HRI) assumes a major importance in order to ensure the highest quality of the interaction itself, as trust directly affects the willingness of people to accept information produced by a robot and to cooperate with it. Many studies have already explored trust that humans give to robots and how this can be enhanced by tuning both the design and the behaviour of the machine, but not so much research has focused on the opposite scenario, that is the trust that artificial agents can assign to people. Despite this, the latter is a critical factor in joint tasks where humans and robots depend on each other's effort to achieve a shared goal: whereas a robot can fail, so can a person. For an artificial agent to know when to trust or distrust somebody and adapt its plans to this prediction can make all the difference in the success or failure of the task. Our work is centred on the design and development of an artificial cognitive architecture for a humanoid autonomous robot that incorporates trust, theory of mind (ToM) and episodic memory, as we believe these are the three key factors for the purpose of estimating the trustworthiness of others. We have tested our architecture on an established developmental psychology experiment [1] and the results we obtained confirm that our approach successfully models trust mechanisms and dynamics in cognitive robots. Trust is a fundamental, unavoidable component of social interactions that can be defined as the willingness of a party (the trustor) to rely on the actions of another party (the trustee), with the former having no control over the latter [2].

Improving Label Quality by Jointly Modeling Items and Annotators Artificial Intelligence

We propose a fully Bayesian framework for learning ground truth labels from noisy annotators. Our framework ensures scalability by factoring a generative, Bayesian soft clustering model over label distributions into the classic David and Skene joint annotator-data model. Earlier research along these lines has neither fully incorporated label distributions nor explored clustering by annotators only or data only. Our framework incorporates all of these properties as: (1) a graphical model designed to provide better ground truth estimates of annotator responses as input to \emph{any} black box supervised learning algorithm, and (2) a standalone neural model whose internal structure captures many of the properties of the graphical model. We conduct supervised learning experiments using both models and compare them to the performance of one baseline and a state-of-the-art model.

The Principles of Deep Learning Theory Artificial Intelligence

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

Leveraging Language to Learn Program Abstractions and Search Heuristics Artificial Intelligence

Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains -- string editing, image composition, and abstract reasoning about scenes -- even when no natural language hints are available at test time.