Scientific Discovery
Listen To Whistler Waves NASA Recorded From Space
Researches have made a breakthrough discovery about the impulsive electron loss that happens in the Earth's upper atmosphere. A paper on the research was published in the Geophysical Review Letters on Wednesday and details the scientific discoveries two spacecraft made about the loss and its cause, according to NASA. The Cubesat FIREBIRD II was one of those craft that recorded the electron microburst when it happened. The craft observed the microbursts from its place orbiting 310 miles above Earth while one of the Van Allen Probes that orbits a bit higher up was able to capture a rising-tone lower band chorus. That chorus of waves had the duration and cadence highly similar to those of the microburst that the FIREBIRD had captured.
Kernel Two-Sample Hypothesis Testing Using Kernel Set Classification
The two-sample hypothesis testing problem is studied for the challenging scenario of high dimensional data sets with small sample sizes. We show that the two-sample hypothesis testing problem can be posed as a one-class set classification problem. In the set classification problem the goal is to classify a set of data points that are assumed to have a common class. We prove that the average probability of error given a set is less than or equal to the Bayes error and decreases as a power of $n$ number of sample data points in the set. We use the positive definite Set Kernel for directly mapping sets of data to an associated Reproducing Kernel Hilbert Space, without the need to learn a probability distribution. We specifically solve the two-sample hypothesis testing problem using a one-class SVM in conjunction with the proposed Set Kernel. We compare the proposed method with the Maximum Mean Discrepancy, F-Test and T-Test methods on a number of challenging simulated high dimensional and small sample size data. We also perform two-sample hypothesis testing experiments on six cancer gene expression data sets and achieve zero type-I and type-II error results on all data sets.
Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data
Karpatne, Anuj, Atluri, Gowtham, Faghmous, James, Steinbach, Michael, Banerjee, Arindam, Ganguly, Auroop, Shekhar, Shashi, Samatova, Nagiza, Kumar, Vipin
Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further, by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science.
Trend Analysis of Fragmented Time Series: Hypothesis Testing Based Adaptive Spline Filtering Method
Missing data present significant challenges to trend analysis of time series. Straightforward approaches consisting of supplementing missing data with constant or zero values or with linear trends can severely degrade the quality of the trend analysis, which significantly reduces the reliability of the trend analysis. We present a robust adaptive approach to discover the trends from fragmented time series. The approach proposed in this paper is based on the HASF (Hypothesis-testing-based Adaptive Spline Filtering) trend analysis algorithm, which can accommodate non-uniform sampling and is therefore inherently robust to missing data. HASF adapts the nodes of the spline based on hypothesis testing and variance minimization, which adds to its robustness.
Data Science- Hypothesis Testing Using Minitab and R
Formulating the Null and the alternate hypothesis for normality test; Choice of null hypothesis based on absence of action and the vice versa for alternate hypothesis; checking for normality in Minitab; interpreting the QโQ plot; Comparing the computed'p' value with ฮฑ (alpha) for taking the decision on whether or not to take the action; Step to performing the 1 sample Z test, selection of appropriate hypothesis in minitab.
Google's AI is a "new paradigm" that unites humans and machines
Google is fully aware of artificial intelligence's (AI) potential -- DeepMind's AlphaGo AI is one of today's most well-known examples of its capabilities -- and in an earnings call this week, the company made it clear they believe the future of technology lies with AI. During the call, Sundar Pichai, CEO of Alphabet (Google's parent company), praised the company's decision to invest in AI early, highlighting the concept's trajectory from "a research project to something that can solve new problems for a billion people a day," according to an Inverse report. Pichai went on to note how Google's AI research is already producing products that utilize machine learning, such as the Google Clips camera that debuted earlier this month. "Even though we are in the early days of AI, we are already rethinking how to build products around machine learning," said Pichai. "It's a new paradigm compared to mobile-first software, and I'm thrilled how Google is leading the way."
From Distance Correlation to Multiscale Generalized Correlation
Shen, Cencheng, Priebe, Carey E., Vogelstein, Joshua T.
Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. We proposed the Multiscale Generalized Correlation (MGC) in Shen et al. 2017 as a novel correlation measure, which worked well empirically and helped a number of real data discoveries. But there is a wide gap with respect to the theoretical side, e.g., the population statistic, the convergence from sample to population, how well does the algorithmic Sample MGC perform, etc. To better understand its underlying mechanism, in this paper we formalize the population version of local distance correlations, MGC, and the optimal local scale between the underlying random variables, by utilizing the characteristic functions and incorporating the nearest-neighbor machinery. The population version enables a seamless connection with, and significant improvement to, the algorithmic Sample MGC, both theoretically and in practice, which further allows a number of desirable asymptotic and finite-sample properties to be proved and explored for MGC. The advantages of MGC are further illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power against monotone dependencies while achieving superior performance against general dependencies.
Uncommon Hypothesis Tests to Debunk Common Misconceptions
I gave a talk about p-values and hypothesis testing at BIDS. Please check out my slides! P-values get a large share of the blame for the replication crisis in science. People take for granted that the tests they use work without justifying the leap from data to model. Often, reported p-values are erroneous because the underlying model doesn't accurately describe the way the data arose.
Open data from the Large Hadron Collider sparks new discovery
Back in 2014, CERN released the data from its Large Hadron Collider (LHC) experiments onto an online portal called the Open Data portal. It was an unprecedented move, making data from the LHC's experiments available to those who don't have access to a particle accelerator. It's not completely up-to-date; there's a three-year embargo on results, so, generally speaking, the most recent data being uploaded is from the year 2014. This was the first time results of any particle collider experiment have been released to the public, and now it's produced results. Last week, a team from MIT released an article in Physical Review Letters that used data from the Compact Muon Solenoid (CMS), one of the LHC's main detectors, to explain a feature within high-energy particle collisions.
The Fourth Paradigm: Data-Intensive Scientific Discovery - Microsoft Research
Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies. In The Fourth Paradigm: Data-Intensive Scientific Discovery, the collection of essays expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science and offers insights into how it can be fully realized. "The individual essays--and The Fourth Paradigm as a whole--give readers a glimpse of the horizon for 21st-century research and, at their best, a peek at what lies beyond. "The impact of Jim Gray's thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science." "I often tell people working in eScience that they aren't in this field because they are visionaries or super-intelligent--it's because they care about science and they are alive now.