Collaborating Authors


Automating Data Science

Communications of the ACM

Data science covers the full spectrum of deriving insight from data, from initial data gathering and interpretation, via processing and engineering of data, and exploration and modeling, to eventually producing novel insights and decision support systems. Data science can be viewed as overlapping or broader in scope than other data-analytic methodological disciplines, such as statistics, machine learning, databases, or visualization.10 To illustrate the breadth of data science, consider, for example, the problem of recommending items (movies, books, or other products) to customers. While the core of these applications can consist of algorithmic techniques such as matrix factorization, a deployed system will involve a much wider range of technological and human considerations. These range from scalable back-end transaction systems that retrieve customer and product data in real time, experimental design for evaluating system changes, causal analysis for understanding the effect of interventions, to the human factors and psychology that underlie how customers react to visual information displays and make decisions. As another example, in areas such as astronomy, particle physics, and climate science, there is a rich tradition of building computational pipelines to support data-driven discovery and hypothesis testing. For instance, geoscientists use monthly global landcover maps based on satellite imagery at sub-kilometer resolutions to better understand how the Earth's surface is changing over time.50 These maps are interactive and browsable, and they are the result of a complex data-processing pipeline, in which terabytes to petabytes of raw sensor and image data are transformed into databases of a6utomatically detected and annotated objects and information. This type of pipeline involves many steps, in which human decisions and insight are critical, such as instrument calibration, removal of outliers, and classification of pixels. The breadth and complexity of these and many other data science scenarios means the modern data scientist requires broad knowledge and experience across a multitude of topics. Together with an increasing demand for data analysis skills, this has led to a shortage of trained data scientists with appropriate background and experience, and significant market competition for limited expertise. Considering this bottleneck, it is not surprising there is increasing interest in automating parts, if not all, of the data science process.

Simulation Intelligence: Towards a New Generation of Scientific Methods Artificial Intelligence

The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science.

Learning from learning machines: a new generation of AI technology to meet the needs of science Artificial Intelligence

We outline emerging opportunities and challenges to enhance the utility of AI for scientific discovery. The distinct goals of AI for industry versus the goals of AI for science create tension between identifying patterns in data versus discovering patterns in the world from data. If we address the fundamental challenges associated with "bridging the gap" between domain-driven scientific models and data-driven AI learning machines, then we expect that these AI models can transform hypothesis generation, scientific discovery, and the scientific process itself.

Revisiting C.S.Peirce's Experiment: 150 Years Later Artificial Intelligence

An iconoclastic philosopher and polymath, Charles Sanders Peirce (1837-1914) is among the greatest of American minds. In 1872, Peirce conducted a series of experiments to determine the distribution of response times to an auditory stimulus, which is widely regarded as one of the most significant statistical investigations in the history of nineteenth-century American mathematical research (Stigler, 1978). On the 150th anniversary of this historic experiment, we look back at Peirce's view on empirical modeling through a modern statistical lens.

DRNets can solve Sudoku, speed scientific discovery


Say you're driving with a friend in a familiar neighborhood, and the friend asks you to turn at the next intersection. The friend doesn't say which way to turn, but since you both know it's a one-way street, it's understood. That type of reasoning is at the heart of a new artificial-intelligence framework – tested successfully on overlapping Sudoku puzzles – that could speed discovery in materials science, renewable energy technology and other areas. An interdisciplinary research team led by Carla Gomes, the Ronald C. and Antonia V. Nielsen Professor of Computing and Information Science in the Cornell Ann S. Bowers College of Computing and Information Science, has developed Deep Reasoning Networks (DRNets), which combine deep learning – even with a relatively small amount of data – with an understanding of the subject's boundaries and rules, known as "constraint reasoning." Di Chen, a computer science doctoral student in Gomes' group, is first author of "Automating Crystal-Structure Phase Mapping by Combining Deep Learning with Constraint Reasoning," published Sept. 16 in Nature Machine Intelligence.

Generalized Multivariate Signs for Nonparametric Hypothesis Testing in High Dimensions Machine Learning

High-dimensional data, where the dimension of the feature space is much larger than sample size, arise in a number of statistical applications. In this context, we construct the generalized multivariate sign transformation, defined as a vector divided by its norm. For different choices of the norm function, the resulting transformed vector adapts to certain geometrical features of the data distribution. Building up on this idea, we obtain one-sample and two-sample testing procedures for mean vectors of high-dimensional data using these generalized sign vectors. These tests are based on U-statistics using kernel inner products, do not require prohibitive assumptions, and are amenable to a fast randomization-based implementation. Through experiments in a number of data settings, we show that tests using generalized signs display higher power than existing tests, while maintaining nominal type-I error rates. Finally, we provide example applications on the MNIST and Minnesota Twin Studies genomic data.

Toward Building Science Discovery Machines Artificial Intelligence

The dream of building machines that can do science has inspired scientists for decades. Remarkable advances have been made recently; however, we are still far from achieving this goal. In this paper, we focus on the scientific discovery process where a high level of reasoning and remarkable problem-solving ability are required. We review different machine learning techniques used in scientific discovery with their limitations. We survey and discuss the main principles driving the scientific discovery process. These principles are used in different fields and by different scientists to solve problems and discover new knowledge. We provide many examples of the use of these principles in different fields such as physics, mathematics, and biology. We also review AI systems that attempt to implement some of these principles. We argue that building science discovery machines should be guided by these principles as an alternative to the dominant approach of current AI systems that focuses on narrow objectives. Building machines that fully incorporate these principles in an automated way might open the doors for many advancements.

Approximation Algorithms for Active Sequential Hypothesis Testing Machine Learning

In the problem of active sequential hypotheses testing (ASHT), a learner seeks to identify the true hypothesis $h^*$ from among a set of hypotheses $H$. The learner is given a set of actions and knows the outcome distribution of any action under any true hypothesis. While repeatedly playing the entire set of actions suffices to identify $h^*$, a cost is incurred with each action. Thus, given a target error $\delta>0$, the goal is to find the minimal cost policy for sequentially selecting actions that identify $h^*$ with probability at least $1 - \delta$. This paper provides the first approximation algorithms for ASHT, under two types of adaptivity. First, a policy is partially adaptive if it fixes a sequence of actions in advance and adaptively decides when to terminate and what hypothesis to return. Under partial adaptivity, we provide an $O\big(s^{-1}(1+\log_{1/\delta}|H|)\log (s^{-1}|H| \log |H|)\big)$-approximation algorithm, where $s$ is a natural separation parameter between the hypotheses. Second, a policy is fully adaptive if action selection is allowed to depend on previous outcomes. Under full adaptivity, we provide an $O(s^{-1}\log (|H|/\delta)\log |H|)$-approximation algorithm. We numerically investigate the performance of our algorithms using both synthetic and real-world data, showing that our algorithms outperform a previously proposed heuristic policy.

Autonomous synthesis of metastable materials Artificial Intelligence

Autonomous experimentation enabled by artificial intelligence (AI) offers a new paradigm for accelerating scientific discovery. Non-equilibrium materials synthesis is emblematic of complex, resource-intensive experimentation whose acceleration would be a watershed for materials discovery and development. The mapping of non-equilibrium synthesis phase diagrams has recently been accelerated via high throughput experimentation but still limits materials research because the parameter space is too vast to be exhaustively explored. We demonstrate accelerated synthesis and exploration of metastable materials through hierarchical autonomous experimentation governed by the Scientific Autonomous Reasoning Agent (SARA). SARA integrates robotic materials synthesis and characterization along with a hierarchy of AI methods that efficiently reveal the structure of processing phase diagrams. SARA designs lateral gradient laser spike annealing (lg-LSA) experiments for parallel materials synthesis and employs optical spectroscopy to rapidly identify phase transitions. Efficient exploration of the multi-dimensional parameter space is achieved with nested active learning (AL) cycles built upon advanced machine learning models that incorporate the underlying physics of the experiments as well as end-to-end uncertainty quantification. With this, and the coordination of AL at multiple scales, SARA embodies AI harnessing of complex scientific tasks. We demonstrate its performance by autonomously mapping synthesis phase boundaries for the Bi$_2$O$_3$ system, leading to orders-of-magnitude acceleration in establishment of a synthesis phase diagram that includes conditions for kinetically stabilizing $\delta$-Bi$_2$O$_3$ at room temperature, a critical development for electrochemical technologies such as solid oxide fuel cells.