Collaborating Authors


The global AI agenda: Promise, reality, and a future of data sharing


"The global AI agenda: Promise, reality, and a future of data sharing" is an MIT Technology Review Insights report produced in partnership with Genesys and Philips. It was developed through a global survey conducted in January and February 2020 of over 1,000 executives across 11 different sectors and a series of interviews with experts having specific responsibility for or knowledge of AI. The article below is an extract of the full report. This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review's editorial staff.

How artificial intelligence can fight cyberattacks


From an organisational perspective, apart from loss of critical information, financial losses, reputational damages and disruption in operations, in most cases, it becomes impossible to identify the intensity of the cyberattack, and the amount of data that was actually compromised often remains unknown. This was witnessed even recently when hackers launched attacks on multiple Indian pharmaceutical companies where, till date, there is no visibility on the degree of attack and the nature of data that was compromised. Cybersecurity is a critical aspect for all organisations today. Unfortunately, most businesses are not adequately equipped to handle these complex cyber threats simply because they continue to rely on traditional techniques. They do not possess the high-end tools required to quickly identify and recover from threats which, if adopted, can go a long way in ensuring cybersecurity.

A Hamiltonian Monte Carlo Model for Imputation and Augmentation of Healthcare Data Machine Learning

Missing values exist in nearly all clinical studies because data for a variable or question are not collected or not available. Inadequate handling of missing values can lead to biased results and loss of statistical power in analysis. Existing models usually do not consider privacy concerns or do not utilise the inherent correlations across multiple features to impute the missing values. In healthcare applications, we are usually confronted with high dimensional and sometimes small sample size datasets that need more effective augmentation or imputation techniques. Besides, imputation and augmentation processes are traditionally conducted individually. However, imputing missing values and augmenting data can significantly improve generalisation and avoid bias in machine learning models. A Bayesian approach to impute missing values and creating augmented samples in high dimensional healthcare data is proposed in this work. We propose folded Hamiltonian Monte Carlo (F-HMC) with Bayesian inference as a more practical approach to process the cross-dimensional relations by applying a random walk and Hamiltonian dynamics to adapt posterior distribution and generate large-scale samples. The proposed method is applied to a cancer symptom assessment dataset and confirmed to enrich the quality of data in precision, accuracy, recall, F1 score, and propensity metric.

Ancestry company uses deepfakes to bring old photos of your great grandma to life


While some companies are fighting nefarious uses of deepfakes, others are embracing the technology for more playful reasons. With our new Deep Nostalgia, you can see how a person from an old photo could have moved and looked if they were captured on video! MyHeritage, a family ancestry company that offers DNA testing much like 23andMe, has unveiled a new AI-powered tool called " Deep Nostalgia." The technology takes your old photos and animates the people in them, producing a full fledged moving picture kind of like the iPhone's Live Photos. To create this completely automated tool, MyHeritage partnered with a company called D-ID, which has written an algorithm that creates these animated videos out of old images.

Securing Healthcare AI with Confidential Computing


The healthcare and pharmaceutical industry has found itself in the spotlight as all eyes turn to it in the race to find and develop a treatment in the fight against COVID-19. With the brightest minds within the healthcare and life sciences industries working together across international boundaries, sharing research data and findings, there has been increased pressure placed upon medical researchers to find the answers that everyone is looking for in the current crisis. Not all of this research can be done manually by the scientists and researchers, so we have seen a dramatic uptick in the application of Artificial Intelligence (AI) and Machine Learning (ML) techniques, which enable researchers to press "fast-forward" on their ability to analyse data, identify trends and/or anomalies, and deliver meaningful results that can then be acted upon. As a result of the vast amount of data that is now being generated, collated, processed, and stored, public attention has focused on how this data is to be secured, in line with expanding international privacy laws and regulatory requirements. Keeping the reems of personal healthcare records and the swathes of Intellectual Property (IP) contained within these AI workflows secure is of paramount importance and this can now be achieved efficiently and effectively with the rise of a new technology – Confidential Computing.

Secure-UCB: Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification Artificial Intelligence

This paper studies bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards, and can contaminate the rewards with additive noise. We show that \emph{any} bandit algorithm with regret $O(\log T)$ can be forced to suffer a regret $\Omega(T)$ with an expected amount of contamination $O(\log T)$. This amount of contamination is also necessary, as we prove that there exists an $O(\log T)$ regret bandit algorithm, specifically the classical UCB, that requires $\Omega(\log T)$ amount of contamination to suffer regret $\Omega(T)$. To combat such poising attacks, our second main contribution is to propose a novel algorithm, Secure-UCB, which uses limited \emph{verification} to access a limited number of uncontaminated rewards. We show that with $O(\log T)$ expected number of verifications, Secure-UCB can restore the order optimal $O(\log T)$ regret \emph{irrespective of the amount of contamination} used by the attacker. Finally, we prove that for any bandit algorithm, this number of verifications $O(\log T)$ is necessary to recover the order-optimal regret. We can then conclude that Secure-UCB is order-optimal in terms of both the expected regret and the expected number of verifications, and can save stochastic bandits from any data poisoning attack.

Measuring Utility and Privacy of Synthetic Genomic Data Artificial Intelligence

Genomic data provides researchers with an invaluable source of information to advance progress in biomedical research, personalized medicine, and drug development. At the same time, however, this data is extremely sensitive, which makes data sharing, and consequently availability, problematic if not outright impossible. As a result, organizations have begun to experiment with sharing synthetic data, which should mirror the real data's salient characteristics, without exposing it. In this paper, we provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data. First, we assess the performance of the synthetic data on a number of common tasks, such as allele and population statistics as well as linkage disequilibrium and principal component analysis. Then, we study the susceptibility of the data to membership inference attacks, i.e., inferring whether a target record was part of the data used to train the model producing the synthetic dataset. Overall, there is no single approach for generating synthetic genomic data that performs well across the board. We show how the size and the nature of the training dataset matter, especially in the case of generative models. While some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Our measurement framework can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild, and will serve as a benchmark tool for researchers and practitioners in the future.

Differentially Private Federated Learning for Cancer Prediction Machine Learning

Since 2014, the NIH funded iDASH (integrating Data for Analysis, Anonymization, SHaring) National Center for Biomedical Computing has hosted yearly competitions on the topic of private computing for genomic data. For one track of the 2020 iteration of this competition, participants were challenged to produce an approach to federated learning (FL) training of genomic cancer prediction models using differential privacy (DP), with submissions ranked according to held-out test accuracy for a given set of DP budgets. More precisely, in this track, we are tasked with training a supervised model for the prediction of breast cancer occurrence from genomic data split between two virtual centers while ensuring data privacy with respect to model transfer via DP. In this article, we present our 3rd place submission to this competition. During the competition, we encountered two main challenges discussed in this article: i) ensuring correctness of the privacy budget evaluation and ii) achieving an acceptable trade-off between prediction performance and privacy budget.

A Large-Scale Database for Graph Representation Learning Artificial Intelligence

With the rapid emergence of graph representation learning, the construction of new large-scale datasets are necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all of these desired properties. We introduce MalNet, the largest public graph database ever constructed, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x the classes. We provide a detailed analysis of MalNet, discussing its properties and provenance. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning---enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publically available at

AI OCR Enhancing the Business Operations


Data entry may sound like a long and boring task but it is one of the most significant of the business operations. Data security is highest of concern as businesses are built on this. In order to secure and manage the data, organizations are required to enter all their data on their online systems. However, entering data of individual customers from its physical form into the online system may take hours of manual labor. In order to enhance this data entry procedure, OCR technology is developed.