Collaborating Authors

scientific discovery

Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery


This paper revisits datasets and evaluation criteria for Symbolic Regression, a task of expressing given data using mathematical equations, specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling range of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets. As an evaluation metric, we also propose to use normalized edit distances between a predicted equation and the ground-truth equation trees. While existing metrics are either binary or errors between the target values and an SR model's predicted values for a given input, normalized edit distances evaluate a sort of similarity between the ground-truth and predicted equation trees.

20-24/06/2022 - AI4SD Machine Learning Summer School : AI 4 Scientific Discovery


We are pleased to announce that this summer AI4SD will be running a hybrid residential summer school from the 20th-24th June 2022 at the University of Southampton. This summer school will introduce you to basic python programming, different areas of machine learning including mathematical foundations for ML, classification and clustering, kernel methods, introduction to deep learning and case studies in chemistry including reinforcement learning in chemistry. There will also be talks to upskill scientists in other relevant areas including Group Management, Presentation Skills, Research Data Management, Referencing, LaTeX, GitHub and Ethics. The summer school will include a hackathon where students can compete in teams to solve the same problem in the best way. Group presentations will take place on the friday and prizes will be given to the winning team.

How Risk Aversion Is Killing the Spirit of Scientific Discovery

Mother Jones

The Allen Telescope Array, used by Northern California's SETI Institute in its often difficult-to-fund search for extraterrestrial life.Redding Record Searchlight / Zuma Press This story was originally published by Undark and is reproduced here as part of the Climate Desk collaboration. Science is built on the boldly curious exploration of the natural world. Astounding leaps of imagination and insight--coupled with a laser like focus on empiricism and experimentation--have brought forth countless wonders of insight into the workings of the universe we find ourselves in. But the culture that celebrates, supports, and rewards the audacious mental daring that is the hallmark of science is at risk of collapsing under a mountain of cautious, risk-averse, incurious advancement that seeks merely to win grants and peer approval. I've encountered this problem myself.

Data Discovery for ML Engineers /


Real-world production ML systems consist of two main components: data and code. Data is clearly the leader, and rapidly taking center stage. Data defines the quality of almost any ML-based product, more so than code or any other aspect. In Feature Store as a Foundation for Machine Learning, we have discussed how feature stores are an integral part of the machine learning workflow. They improve the ROI of data engineering, reduce cost per model, and accelerate model-to-market by simplifying feature definition and extraction.

Understanding Type-I and Type-II errors in hypothesis testing


We all can relate to thinking about whether route A will take less time than route B, if the average return on investment X is more than investment Y, and if movie ABC is better than movie XYZ. In all these cases, we are testing some hypotheses we have in our minds. Setting up hypotheses, proving/disproving them using data, and helping businesses make decisions is like bread and butter for Data Scientists. Data Scientists often rely on probabilities to understand the likelihood of observing data by chance and use that to make conclusions around a hypothesis. Hence, there are always scenarios of making errors while making conclusions around our assumed hypothesis. The below post is written to provide an intuitive yet detailed explanation of Type-I and Type-II errors that happen during statistical hypothesis testing.

Hypothesis Testing


In statistics, hypothesis testing is a form of inference using data to draw certain conclusions about the population. First, we make an assumption about the population which is known as the Null Hypothesis. It is denoted by H₀. Then we define the Alternate Hypothesis which is the opposite of what is stated in the Null Hypothesis, denoted by Hₐ. After defining both the Null Hypothesis and Alternate Hypothesis we perform what is known as a hypothesis test to either accept or reject the Null Hypothesis.

Automating Data Science

Communications of the ACM

Data science covers the full spectrum of deriving insight from data, from initial data gathering and interpretation, via processing and engineering of data, and exploration and modeling, to eventually producing novel insights and decision support systems. Data science can be viewed as overlapping or broader in scope than other data-analytic methodological disciplines, such as statistics, machine learning, databases, or visualization.10 To illustrate the breadth of data science, consider, for example, the problem of recommending items (movies, books, or other products) to customers. While the core of these applications can consist of algorithmic techniques such as matrix factorization, a deployed system will involve a much wider range of technological and human considerations. These range from scalable back-end transaction systems that retrieve customer and product data in real time, experimental design for evaluating system changes, causal analysis for understanding the effect of interventions, to the human factors and psychology that underlie how customers react to visual information displays and make decisions. As another example, in areas such as astronomy, particle physics, and climate science, there is a rich tradition of building computational pipelines to support data-driven discovery and hypothesis testing. For instance, geoscientists use monthly global landcover maps based on satellite imagery at sub-kilometer resolutions to better understand how the Earth's surface is changing over time.50 These maps are interactive and browsable, and they are the result of a complex data-processing pipeline, in which terabytes to petabytes of raw sensor and image data are transformed into databases of a6utomatically detected and annotated objects and information. This type of pipeline involves many steps, in which human decisions and insight are critical, such as instrument calibration, removal of outliers, and classification of pixels. The breadth and complexity of these and many other data science scenarios means the modern data scientist requires broad knowledge and experience across a multitude of topics. Together with an increasing demand for data analysis skills, this has led to a shortage of trained data scientists with appropriate background and experience, and significant market competition for limited expertise. Considering this bottleneck, it is not surprising there is increasing interest in automating parts, if not all, of the data science process.

Science and innovation relies on successful collaboration


It may sound obvious, perhaps even clichéd, but this mantra is something that must be remembered in ongoing political negotiations over Horizon Europe, which could see Switzerland and the UK excluded from EU research projects. We need more, not fewer, researchers collaborating to solve today's and tomorrow's challenges. By closely working with Swiss and British researchers, who have long played key roles, Horizon Europe projects will benefit – as they have in the past. This is the motivation behind ETH Zurich, which collaborates with IBM Research on nanotechnology, leading the Stick to Science campaign. This calls on all three parties – Switzerland, the UK and the EU – to try and solve the current stalemate and put Swiss and British association agreements in place.

How AI is Changing Chemical Discovery


While engineering, finance, and commerce have profited immensely from novel algorithms, they are not the only ones. Large-scale computation has been an integral part of the toolkit in the physical sciences for many decades - and some of the recent advances in AI have started to change how scientific discoveries are made. There has been a lot of excitement about prominent achievements in the physical sciences, like using machine learning to render an image of a black hole or the contribution of AlphaFold towards protein folding. This article will cover some of the more prominent usages of AI in chemistry, the parent discipline of the aforementioned protein folding problem. One of the chief goals of chemistry is to understand matter, its properties, and the transformations it can undergo.

Artificial Intelligence Trends to Look Forward To In 2022


These jaw-breaking developments gave rise to expectations from AI and made many curious about upcoming trends and advances in the field. Thus, this article will highlight some of the key forthcoming developments in AI, poised to make it more potent and impactful. Language modeling is machine understanding and generation of natural languages, which is used in applications such as speech recognition, machine translation, handwriting recognition, question answering and information retrieval. Since OpenAI released GPT-3, the most powerful language model ever built, it has been in the limelight due to its breathtaking language capabilities. For example, it has been demonstrated that--with proper human priming--GPT-3 can generate creative fiction, working computer code and compose introspective business memos.