Goto

Collaborating Authors

 Scientific Discovery


A Simple Way to Deal with Cherry-picking

arXiv.org Machine Learning

Statistical hypothesis testing serves as statistical evidence for scientific innovation. However, if the reported results are intentionally biased, hypothesis testing no longer controls the rate of false discovery. In particular, we study such selection bias in machine learning models where the reporter is motivated to promote an algorithmic innovation. When the number of possible configurations (e.g., datasets) is large, we show that the reporter can falsely report an innovation even if there is no improvement at all. We propose a `post-reporting' solution to this issue where the bias of the reported results is verified by another set of results. The theoretical findings are supported by experimental results with synthetic and real-world datasets.


The Rise of Dataism: A Threat to Freedom or a Scientific Revolution?

#artificialintelligence

What would happen if we made all of our data public--everything from wearables monitoring our biometrics, all the way to smartphones monitoring our location, our social media activity, and even our internet search history? Would such insights into our lives simply provide companies and politicians with greater power to invade our privacy and manipulate us by using our psychological profiles against us? A burgeoning new philosophy called dataism doesn't think so. In fact, this trending ideology believes that liberating the flow of data is the supreme value of the universe, and that it could be the key to unleashing the greatest scientific revolution in the history of humanity. First mentioned by David Brooks in his 2013 New York Times article "The Philosophy of Data," dataism is an ethical system that has been most heavily explored and popularized by renowned historian, Yuval Noah Harari.


Poker-Faced Trading: Will This Theory Change Your Strategy?

#artificialintelligence

Sure, you've heard of game theory. And sure, trading is like a game: you devise a strategy, learn the rules, and try to beat everyone else to the punch, finding trends before anyone else does. You've even heard that letting your emotions get the best of you is a terrible trading strategy. To start to undermine our own self-destructive habits, it helps to understand how the game is geared toward those who figure out the game behind the game: how to play to win by controlling the emotions that lead us to bad decisions. Game theory can be applied to human trading because the object of a trade is to "win" a profit.


The Fourth Copernican Revolution - Issue 64: The Unseen

Nautilus

The sun formed 4.5 billion years ago, but it's got around 6 billion years more before its fuel runs out. It will then flare up, engulfing the inner planets. And the expanding universe will continue--perhaps forever--destined to become ever colder, ever emptier. To quote Woody Allen, eternity is very long, especially toward the end. Any creatures witnessing the sun's demise won't be human--they'll be as different from us as we are from a bug.


Multi-level hypothesis testing for populations of heterogeneous networks

arXiv.org Machine Learning

In this work, we consider hypothesis testing and anomaly detection on datasets where each observation is a weighted network. Examples of such data include brain connectivity networks from fMRI flow data, or word co-occurrence counts for populations of individuals. Current approaches to hypothesis testing for weighted networks typically requires thresholding the edge-weights, to transform the data to binary networks. This results in a loss of information, and outcomes are sensitivity to choice of threshold levels. Our work avoids this, and we consider weighted-graph observations in two situations, 1) where each graph belongs to one of two populations, and 2) where entities belong to one of two populations, with each entity possessing multiple graphs (indexed e.g. by time). Specifically, we propose a hierarchical Bayesian hypothesis testing framework that models each population with a mixture of latent space models for weighted networks, and then tests populations of networks for differences in distribution over components. Our framework is capable of population-level, entity-specific, as well as edge-specific hypothesis testing. We apply it to synthetic data and three real-world datasets: two social media datasets involving word co-occurrences from discussions on Twitter of the political unrest in Brazil, and on Instagram concerning Attention Deficit Hyperactivity Disorder (ADHD) and depression drugs, and one medical dataset involving fMRI brain-scans of human subjects. The results show that our proposed method has lower Type I error and higher statistical power compared to alternatives that need to threshold the edge weights. Moreover, they show our proposed method is better suited to deal with highly heterogeneous datasets.


Thomas Kuhn Threw an Ashtray at Me - Issue 63: Horizons

Nautilus

Errol Morris feels that Thomas Kuhn saved him from a career he was not suited for--by having him thrown out of Princeton. In 1972, Kuhn was a professor of philosophy and the history of science at Princeton, and author of The Structure of Scientific Revolutions, which gave the world the term "paradigm shift." As Morris tells the story in his recent book, The Ashtray, Kuhn was antagonized by Morris' suggestions that Kuhn was a megalomaniac and The Structure of Scientific Revolutions was an assault on truth and progress. To say the least, Morris, then 24, was already the iconoclast who would go on to make some of the most original documentary films of our time. After launching the career he was suited for with The Gates of Heaven in 1978, a droll affair about pet cemeteries, Morris earned international acclaim with The Thin Blue Line, which led to the reversal of a murder conviction of a prisoner who had been on death row. In 2004, Morris won an Academy Award for The Fog of War, a dissection of former Secretary of Defense Robert McNamara, a major architect of the Vietnam War. His 2017 film, Wormwood, a miniseries on Netflix, centers on the mystery surrounding a scientist who in 1975 worked on a biological warfare program for the Army, and suspiciously fell to his death from a hotel room. The Ashtray--Morris explains the title in our interview below--is as arresting and idiosyncratic as Morris' films.


AI for code encourages collaborative, open scientific discovery

#artificialintelligence

We have seen significant recent progress in pattern analysis and machine intelligence applied to images, audio and video signals, and natural language text, but not as much applied to another artifact produced by people: computer program source code. In a paper to be presented at the FEED Workshop at KDD 2018, we showcase a system that makes progress towards the semantic analysis of code. By doing so, we provide the foundation for machines to truly reason about program code and learn from it. The work, also recently demonstrated at IJCAI 2018, is conceived and led by IBM Science for Social Good fellow Evan Patterson and focuses specifically on data science software. Data science programs are a special kind of computer code, often fairly short, but full of semantically rich content that specifies a sequence of data transformation, analysis, modeling, and interpretation operations.


Data Discovery Evolving Into Information Relationship Mapping Leveraging Machine Learning

Forbes - Tech

What once started as early analysis of singular data sources has now evolved into far more robust ways of analyzing information and the relationships between different fields and information sources. Data discovery is another area where machine learning (ML) is beginning to make inroads. Twenty years ago, data discovery was a term used to define the early analytics needed to better understand data. For instance, Evoke Software was a company that analyzed large volumes of customer data. It both used metadata to understand field content to find trends and exceptions, and also looked at raw data and used algorithms to identify field boundaries in older or less documented data sources.


A look at the leading data discovery software and vendors

#artificialintelligence

Turning data into business insight is the ultimate goal. It's not about gathering as much data as possible, it's about applying tools and making discoveries that help a business succeed. The data discovery software market includes a range of software and cloud-based services that can help organizations gain value from their constantly growing information resources. These products fall within the broad BI category, and at their most basic, they search for patterns within data and data sets. Many of these tools use visual presentation mechanisms, such as maps and models, to highlight patterns or specific items of relevance.


Request-and-Reverify: Hierarchical Hypothesis Testing for Concept Drift Detection with Expensive Labels

arXiv.org Artificial Intelligence

One important assumption underlying common classification models is the stationarity of the data. However, in real-world streaming applications, the data concept indicated by the joint distribution of feature and label is not stationary but drifting over time. Concept drift detection aims to detect such drifts and adapt the model so as to mitigate any deterioration in the model's predictive performance. Unfortunately, most existing concept drift detection methods rely on a strong and over-optimistic condition that the true labels are available immediately for all already classified instances. In this paper, a novel Hierarchical Hypothesis Testing framework with Request-and-Reverify strategy is developed to detect concept drifts by requesting labels only when necessary. Two methods, namely Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with Attribute-wise "Goodness-of-fit" (HHT-AG), are proposed respectively under the novel framework. In experiments with benchmark datasets, our methods demonstrate overwhelming advantages over state-of-the-art unsupervised drift detectors. More importantly, our methods even outperform DDM (the widely used supervised drift detector) when we use significantly fewer labels.