Collaborating Authors


Creating Confidence Intervals for Machine Learning Classifiers


My name is Sebastian, and I am a machine learning and AI researcher with a strong passion for education. As Lead AI Educator at, I am excited about making AI & deep learning more accessible and teaching people how to utilize AI & deep learning at scale. I am also an Assistant Professor of Statistics at the University of Wisconsin-Madison and author of the bestselling book Python Machine Learning.

Precision Medicine in Stroke: Outcome Predictions Using AI


New and continuously improving treatment options such as thrombolysis and thrombectomy have revolutionized acute stroke treatment in recent years. Following modern rhythms, the next revolution might well be the strategic use of the steadily increasing amounts of patient-related data for generating models enabling individualized outcome predictions. Milestones have already been achieved in several health care domains, as big data and artificial intelligence have entered everyday life. The aim of this review is to synoptically illustrate and discuss how artificial intelligence approaches may help to compute single-patient predictions in stroke outcome research in the acute, subacute and chronic stage. We will present approaches considering demographic, clinical and electrophysiological data, as well as data originating from various imaging modalities and combinations thereof. We will outline their advantages, disadvantages, their potential pitfalls and the promises they hold with a special focus on a clinical audience.

An Enhanced Secure Deep Learning Algorithm for Fraud Detection in Wireless Communication


In today’s era of technology, especially in the Internet commerce and banking, the transactions done by the Mastercards have been increasing rapidly. The card becomes the highly useable equipment for Internet shopping. Such demanding and inflation rate causes a considerable damage and enhancement in fraud cases also. It is very much necessary to stop the fraud transactions because it impacts on financial conditions over time the anomaly detection is having some important application to detect the fraud detection. A novel framework which integrates Spark with a deep learning approach is proposed in this work. This work also implements different machine learning techniques for detection of fraudulent like random forest, SVM, logistic regression, decision tree, and KNN. Comparative analysis is done by using various parameters. More than 96% accuracy was obtained for both training and testing datasets. The existing system like Cardwatch, web service-based fraud detection, needs labelled data for both genuine and fraudulent transactions. New frauds cannot be found in these existing techniques. The dataset which is used contains transaction made by credit cards in September 2013 by cardholders of Europe. The dataset contains the transactions occurred in 2 days, in which there are 492 fraud transactions out of 284,807 which is 0.172% of all transaction.

Toward Reduction in False-Positive Thyroid Nodule Biopsies with a Deep Learning–based Risk Stratification System Using US Cine-Clip Images


The Cine-CNNTrans achieved an average AUC of 0.88 0.10 for classifying benign versus malignant thyroid nodules. The Cine-CNNTrans showed higher AUC than the Static-2DCNN (P .03). For aggregating framewise outputs into nodulewise scores, the Cine-CNNTrans tended toward higher AUC compared with the Cine-CNNAvePool (P .17). Our system tended toward higher AUC than the Cine-Radiomics and the ACR TI-RADS level, though the difference did not achieve statistical significance (P .16

Artificial Intelligence in Nephrology: How Can Artificial Intelligence Augment Nephrologists' Intelligence?


Background: Artificial intelligence (AI) now plays a critical role in almost every area of our daily lives and academic disciplines due to the growth of computing power, advances in methods and techniques, and the explosion of the amount of data; medicine is not an exception. Rather than replacing clinicians, AI is augmenting the intelligence of clinicians in diagnosis, prognosis, and treatment decisions. Summary: Kidney disease is a substantial medical and public health burden globally, with both acute kidney injury and chronic kidney disease bringing about high morbidity and mortality as well as a huge economic burden. Even though the existing research and applied works have made certain contributions to more accurate prediction and better understanding of histologic pathology, there is a lot more work to be done and problems to solve. Key Messages: AI applications of diagnostics and prognostics for high-prevalence and high-morbidity types of nephropathy in medical-resource-inadequate areas need special attention; high-volume and high-quality data need to be collected and prepared; a consensus on ethics and safety in the use of AI technologies needs to be built. Artificial intelligence (AI) now plays a critical role in almost every area of our daily lives and academic disciplines; medicine is not an exception.

Newsletter #73 -- DeepMind's 600 task AI agent


The company has raised $100 million in round C funding with the aim of becoming the "GitHub of machine learning". Inflection -- is an AI-first company aiming to redefine human-computer interaction. It is led by LinkedIn and DeepMind co-founders and was referenced in our Newsletter #68. The company has now raised $225 million in venture funding to use AI to help humans "talk" to computers. Unlearn -- aims to accelerate clinical trials by using AI, digital twins, and novel statistical methods to "enable smaller control groups while maintaining power and generating evidence suitable for supporting regulatory decisions".

Agent-Based Modeling for Predicting Pedestrian Trajectories Around an Autonomous Vehicle

Journal of Artificial Intelligence Research

This paper addresses modeling and simulating pedestrian trajectories when interacting with an autonomous vehicle in a shared space. Most pedestrian–vehicle interaction models are not suitable for predicting individual trajectories. Data-driven models yield accurate predictions but lack generalizability to new scenarios, usually do not run in real time and produce results that are poorly explainable. Current expert models do not deal with the diversity of possible pedestrian interactions with the vehicle in a shared space and lack microscopic validation. We propose an expert pedestrian model that combines the social force model and a new decision model for anticipating pedestrian–vehicle interactions. The proposed model integrates different observed pedestrian behaviors, as well as the behaviors of the social groups of pedestrians, in diverse interaction scenarios with a car. We calibrate the model by fitting the parameters values on a training set. We validate the model and evaluate its predictive potential through qualitative and quantitative comparisons with ground truth trajectories. The proposed model reproduces observed behaviors that have not been replicated by the social force model and outperforms the social force model at predicting pedestrian behavior around the vehicle on the used dataset. The model generates explainable and real-time trajectory predictions. Additional evaluation on a new dataset shows that the model generalizes well to new scenarios and can be applied to an autonomous vehicle embedded prediction.

Text Analysis of Job Descriptions for Data Scientists, Data Engineers, Machine Learning Engineers and Data Analysts


Introduction In the previous post, the intrepid Jesse Blum and I analyzed metadata from over 6,500 job descriptions for data roles in seven European countries. In this post, we’ll apply text analysis to those job postings to better understand the technologies and skills that employers are looking for in data scientists, data engineers, data analysts, and machine learning engineers. In this post we present results from text analyses that show that: Data analysts are expected to have skills in reporting, dashboarding, data analysis, and office suite software. Data scientists are expected to know more about data science, statistics, mathematics, and making predictions. Data engineers are expected to industrialize organizrations’ cloud and data architecture and infrastructure. Machine learning engineers are expected to use artificial intelligence and deep learning frameworks such as TensorFlow and Pytorch. The results of this analysis complement and extend the results we presented last time, showing that employers have distinct visions of the (mostly technical & software-related) skillsets that data analysts, data scientists, data engineers, and machine learning engineers should possess. The Data The data come from a web scraping program developed by Jesse and myself. Every 2 weeks, we scraped job advertisements from a major job portal website, extracting all jobs posted within the previous 2-week period for the following job titles: Data Engineer, Data Analyst, Data Scientist and Machine Learning Engineer for the following countries: the United Kingdom, Ireland, Germany, France, the Netherlands, Belgium and Luxembourg. We started data collection mid-August and finished by the end of December, 2021, ending up with 6,590 job descriptions scraped. All the data and code used for this analysis are available on Github. Feedback welcome! Results Our dataset includes job descriptions for data roles across four languages (English, French, Dutch and German). We wanted to see if there were any differences in word usage among the different roles (data scientist, data engineer, machine learning engineer and data analyst), and therefore conducted language-specific analyses to contrast and compare the roles according to the words used to describe the job openings. Word Clouds Our first set of analyses uses a great R function to create comparison clouds. This type of analysis allows us to compare the frequency of words across groups of documents, and highlight words that appear more in a given group versus the others. Jesse and I are more comfortable in English, French, and Dutch than German, so we limited our analysis to those three languages. However, there were far fewer Dutch job descriptions than for the other two, so the resulting Dutch comparison cloud was not particularly informative. Below, we focus on the English and French wordclouds and what they reveal about employers’ expectations for the different roles. The French word cloud looks like this: The English word cloud looks like this: Overall, we found that there were clear differences between the roles in the language used in the job advertisements. Furthermore, these differences were largely consistent across the English and French language job ads. The following table summarizes the comparison: Role French (N = 1,349) Job Descriptions English (N = 3,869) Job Descriptions Data analysts Expected to know about data analysis (analyse), reporting (reporting, tableau de bord), and data visualization (visualisation). Likely work more with stakeholders in the business (métier). In contrast to the English job description texts, data analysts are expected to know more about SQL (in English this word appeared more frequently in data engineering job descriptions). Expected to have skills in reporting, dashboarding, data analysis and office suite. More interaction with other stakeholders throughout the larger organization. More emphasis on identifying insights (which need to be communicated to others in order to inform decision making). Data scientists Relatively few unique skills. Expected to know data science and statistics (statistique), and to build models (modèle) and make predictions (prédiction). Relatively few unique skills. Expected to know about data science, statistics, mathematics and making predictions. Data engineers Greater expectation to work with cloud platforms (plateforme, cloud, Azure), big data technologies (Scala and Spark), data pipelines, etl and data storage (stockage). Somewhat surprisingly, data engineers, compared to the other roles, are expected to work with agile methodology. Greater expectation to work with cloud and data platforms, etl (data transfer & storage) and data pipelines, databases, data architecture and infrastructure, Spark and SQL. Essentially, the technologies and databases that go along with storing and transferring data from one place to another are under the responsibility of the data engineer. Machine learning engineers Greater expectation to use machine learning (apprentissage automatique), artificial intelligence (intelligence artificielle), and tools for deep learning algorithms / neural networks (réseau de neurones artificiels) like TensorFlow and Pytorch. Greater expectation to use artificial intelligence and deep learning frameworks such as TensorFlow and Pytorch. Greater expectation to know more about software engineering and computer science. Interestingly, the text of the English job ads reveals that machine learning engineers are being asked to work on computer vision problems. Some other observations that we found noteworthy: There are strikingly few terms that are unique to the data scientist role, suggesting large overlaps with the other profiles. As recently as a couple of years ago, the roles of data engineer and machine learning engineer were much less prevalent and many of the responsibilities currently assigned to these roles fell under the purview of data scientists. With the growth of other data roles and a resulting divvying up of data work, it seems as though organizations are not entirely clear as to what exactly the unique characteristics of data scientists are. While the conclusions from the wordclouds were virtually identical across languages, there were some notable differences among the different roles between English and French. For example, the French machine learning engineer ads were more likely to include innovation than the English ones, perhaps suggesting that this work is taking place in R&D or innovation centers of larger companies. The French job descriptions for data engineers were more likely to mention agile methodology, and the French job descriptions for data analysts were more likely to mention SQL (in English, this technology was more prevalent for the data engineer job ads). Finally, it was interesting to note that many of the terms used in French job descriptions are actually English words. For example, cloud, reporting, and deep learning could all be translated into French, but they’re usually left in English. Other jargon surrounding data professions, however, has well-established French equivalents. For instance, tableau de bord is the French equivalent of dashboard, intelligence artificielle is the French equivalent of artificial intelligence, and apprentissage automatique is the French equivalent of machine learning. So if you’re trying to understand the tech industry in France, it’s perhaps worth brushing up on your English vocabulary! Using Skills-ML to Extract Skills from Job Ads The Skills ML library is a great tool for extracting high-level skills from job descriptions. The Skills ML library uses a dictionary-based word search approach to scan through text and identify skills from the ONET skill ontology, allowing for the extraction of important high-level skills mapped by labor market experts. This approach is more comprehensive than simply counting words (as we did with the comparison clouds above), and it takes into account the fact that some words are synonyms or represent the same skill or technology (e.g.”database”, “data warehouse”, “data lake”, etc. can be grouped under a higher-level term such as “data storage”). Because the ONET skills are only available in English, this analysis was conducted only on the English-language job descriptions. Most Common Skills As the following figure shows, Python was the most common skill represented in the English-language job descriptions. Other top skills include R, programming, mathematics, Tableau, visualization, writing, Git, and physics. However, this analysis collapses all the skills across the four data roles. We saw in the wordcloud analysis above and in the previous analysis of job keywords that the desired skillsets can look quite different between the different data profiles. Clustering Skills and Roles In order to get a sense of how the extracted skills differed across the data roles, we made a cluster map using the Python Seaborn library. Specifically, we calculated the percentage of job ads per role that contained each skill, filtering on skills that appeared in more than 50 job ads. These percentages were converted to z-scores, such that higher numbers indicate that a given skill is mentioned more often for a given role compared to the others. This final matrix was then passed to the cluster map algorithm, which performs a simultaneous clustering of both the job roles and of the extracted skills. The results of this analysis showed that there are clear clusters of skillsets required for different types of data-related roles. In the clustering diagram, shades of red indicate a higher prevalence of a given skill for a given role compared to the others, while shades of blue indicate a lower prevalence of a given skill for a given role compared to the others. Along the horizontal axis, individual skills are clustered together in logical ways. For instance, at the right side of the chart, Microsoft Office is grouped together with Microsoft Excel and Google Analytics. On the vertical axis, roles cluster into three separate groups according to their required skills: Data analysts are in their own cluster at the top of the graph, with skills that are most different from the other roles. In particular, job ads for data analysts are more likely to mention office-suite software (e.g. Microsoft Office & Excel, Google Analytics), data visualization / dashboarding tools (e.g. Tableau), and sales management & tracking tools (e.g. Salesforce). Job ads for data analysts are less likely to mention programming tools such as Git, Python, programming languages, etc. Data engineers are grouped in an overall cluster with data scientists and machine learning engineers, but have a separate branch from the other two. The job ads for data engineers were comparatively more likely to mention data tools (Oracle, noSQL, MySQL, MongoDB, PostgreSQL, and Apache Spark). This suggests that data engineers are expected to play greater roles in the development and maintenance of an organization’s data infrastructure, compared to the other three roles. Data scientists and machine learning engineers are placed together in the same cluster at the bottom of the graph. These two roles overlap in terms of computer science skills like programming, Python, and Git, and in domain knowledge in scientific fields such as biology and physics. Furthermore, job ads for these two roles are also less likely to require office suite software capabilities (e.g. Excel), visualization (e.g. Tableau), and business tools (e.g. Salesforce) that are most characteristic of data analysts. However, there are some differences between data scientists and machine learning engineers. Data scientists’ job ads have higher prevalence of mathematics, chemistry and R, while job ads for machine learning engineers have higher prevalence of operating systems (e.g. Unix and Linux), and programming languages (e.g. Javascript and C). The Added Value of Analyzing Job Description Texts Overall, the above analysis serves as a useful extension of the Metadata analysis we described in our previous post. Here, we first presented comparison clouds showing the relative frequency of words that were unique to a given role compared to the others. We made separate word clouds for the texts of the English and French job ads, respectively, and found that the main conclusions from these visualizations were the same. Interesting findings from this analysis included: Data analysts are expected to work with dashboarding, data analysis and Office tools like Excel. Of all of the profiles, job descriptions for data analysts were more likely to mention contact with the business, interacting with stakeholders and generating and communicating insights. Data scientists, in contrast, had relatively few unique words in their job descriptions. Compared to the other roles, they are expected to know about statistics, mathematics and making predictions from models. Our sense was that, given the recent growth of other data roles such as data engineers and machine learning engineers, there is some degree of ambiguity regarding the distinct characteristics that data scientists should have compared to the other roles. The job ads for data engineers had a long list of data storage and transfer technologies that were unique to this role. Data engineers are expected to master many different types of databases and cloud platforms in order to move data around and store it in a proper way. Finally, job ads for machine learning engineers were more likely to contain mentions of artificial intelligence and deep learning frameworks like Pytorch and TensorFlow, applied to domains such as computer vision. We also extracted skills from the English language job descriptions using the ONET skill classification. As in our previous analysis of skill keywords, Python was the most frequently-appearing skill. We then made a clustermap to see how the extracted skills differed across the roles. In this analysis, the data analysts role had least in common with the others. Data analysts in particular were more likely to use office tools (Excel, Google Analytics), visualization tools (e.g. Tableau) and business software (e.g. Salesforce), and less likely to use programming tools and languages (e.g. Git and Python). Data Engineers also had their own specialties, being particularly likely to work with a wider variety of data storage, big data, and query technologies (e.g. many flavors of SQL, Apache Spark etc.) This analysis shows that data analysts and data engineers have very different skillsets, with data analysts being more focused on office and business software, and data engineers being more focused on programming and databases. This highlights the importance of having both roles on a team in order to have a well-rounded skillset, and the unlikeliness of having one person being equally good at both skillsets (the long-sought after but rarely-found “unicorn” profile). The End This is the final post that we’ll make of the analysis of these job description data. All of the data and code for these analyses are available on Github, and we encourage you to explore them further! This exercise was very meta for us, challenging ourselves across data analysis, data science, data engineering. Both the metadata analysis presented previously and the current text analysis helped us clarify our thinking about the market for data profiles in Europe, and we hope to have expanded your understanding of the data professions and the skills that unite and differentiate them. The job market is evolving quickly, as are the technologies and tools that data professionals are being asked to master. Our analysis of European job descriptions offers a snapshot of the current job market, and we are excited to see what the future brings as European companies’ and institutions’ data efforts mature and as the market continues to evolve!

The next step in deep learning-guided clinical trials - Nature Cardiovascular Research


A combined imaging–clinical risk prediction model including the use of deep learning to predict sudden cardiac death (SCD) seems promising in patients with ischemic cardiomyopathy. A deep learning model could potentially recommend precision implantable cardioverter–defibrillator (ICD) implantation, leading to a personalized approach to the primary prevention of SCD. Therefore, an urgent need exists to design deep-learning-guided clinical trials. A potential example could be randomizing patients to ICD implantation versus conventional therapy based either on deep learning model prediction or on traditional ICD indications, which are primarily based on left ventricular ejection fraction (LVEF) and New York Heart Association (NYHA) functional class (from the MADIT‐II and SCD‐HeFT trials).

Deep learning algorithm shows accuracy in detecting glaucoma on fundus photographs


Automated deep learning analysis of fundus photographs showed high diagnostic accuracy in determining primary open-angle glaucoma, with increased ability to detect glaucoma earlier than human readers. A deep learning (DL) algorithm was trained, validated and tested on the fundus stereophotographs of participants enrolled in the Ocular Hypertension Treatment Study (OHTS), a randomized clinical trial evaluating the safety and efficacy of IOP-lowering medications in preventing progression from ocular hypertension to primary open-angle glaucoma (POAG). Assessment of optic disc and visual field changes in the OHTS was performed by two reading centers and a masked committee of glaucoma specialists, "a demanding, laborious and complicated task," according to the authors. The OHTS data set consisted of fundus photographs from 1,636 participants, of which 1,147 were included in the training set, 167 in the validation set and 322 in the test set. The DL model detected conversion to POAG with high diagnostic accuracy, suggesting that artificial intelligence can offer a reliable tool to automate the determination of glaucoma for clinical trial management, simplifying the process of human interpretation and, possibly, making it more standardized, objective and accurate.