Collaborating Authors



Communications of the ACM

We present SoundWatch, a smartwatch-based deep learning application to sense, classify, and provide feedback about sounds occurring in the environment.

Responsible Data Management

Communications of the ACM

Incorporating ethics and legal compliance into data-driven algorithmic systems has been attracting significant attention from the computing research community, most notably under the umbrella of fair8 and interpretable16 machine learning. While important, much of this work has been limited in scope to the "last mile" of data analysis and has disregarded both the system's design, development, and use life cycle (What are we automating and why? Is the system working as intended? Are there any unforeseen consequences post-deployment?) and the data life cycle (Where did the data come from? How long is it valid and appropriate?). In this article, we argue two points. First, the decisions we make during data collection and preparation profoundly impact the robustness, fairness, and interpretability of the systems we build. Second, our responsibility for the operation of these systems does not stop when they are deployed. To make our discussion concrete, consider the use of predictive analytics in hiring. Automated hiring systems are seeing ever broader use and are as varied as the hiring practices themselves, ranging from resume screeners that claim to identify promising applicantsa to video and voice analysis tools that facilitate the interview processb and game-based assessments that promise to surface personality traits indicative of future success.c Bogen and Rieke5 describe the hiring process from the employer's point of view as a series of decisions that forms a funnel, with stages corresponding to sourcing, screening, interviewing, and selection. The hiring funnel is an example of an automated decision system--a data-driven, algorithm-assisted process that culminates in job offers to some candidates and rejections to others. The popularity of automated hiring systems is due in no small part to our collective quest for efficiency.

US Companies Must Deal with EU AI law, Like It or Not


Don't look now, but using Google Analytics to track your website's audience might be illegal. That's the view of a court in Austria, which in January found that Google's data product was in breach of the European Union's General Data Protection Regulation (GDPR) as it was not doing enough to make sure data transferred from the EU to the company's servers in the US was protected (from, say, US intelligence agencies). Well for those working in AI and biotech, it matters, especially to those working outside of Europe with a view to expansion there. For a start, this is a major precedent that threatens to upend the way many tech companies work, since the tech sector relies heavily on the safe use and transfer of large quantities of data. Whether you use Google Analytics is neither here nor there; the case has shown that Privacy Shield -- the EU-US framework that governs the transfer of personal information in compliance with GDPR -- may not be compliant with European law after all.

Market Segmentation in the Emoji Era

Communications of the ACM

Ishaan and Elizabeth, both graduate students in business, are attending a marketing strategy lecture at a business school in the Northeast. While learning about the principles of market segmentation, Ishaan texts "outdated" followed by three thinking--face emojis to Elizabeth. He wonders how demographic-, geographic-, or psychographic-based segmentation--the topic of the lecture--can help his family's franchise restaurant deal with the hundreds of sometimes-not-so-positive online reviews and social media posts. Meanwhile, Elizabeth hopes that the fast-food restaurant where she ordered her lunch understands that she now belongs to the segment of'extremely displeased' customers. Earlier, she used the restaurant's new app to order a burrito without cheese and sour cream, only to discover that the meal included both offending ingredients. Her lunch went straight into the trash can and she angrily tweeted her disappointment to the restaurant. This simple vignette illustrates an important point. Organizations of every size are challenged with capitalizing on enormous amounts of unstructured organizational data--for instance, from social media posts--particularly for applications such as market segmentation. The purpose of this article is to give the reader an idea of the challenges and opportunities faced by businesses using market segmentation, including the impacts of big data. Our research will demonstrate what market segmentation might look like in the near future, as we also offer a promising approach to implementing market segmentation using unstructured data.

The New Intelligence Game


The relevance of the video is that the browser identified the application being used by the IAI as Google Earth and, according to the OSC 2006 report, the Arabic-language caption reads Islamic Army in Iraq/The Military Engineering Unit – Preparations for Rocket Attack, the video was recorded in 5/1/2006, we provide, in Appendix A, a reproduction of the screenshot picture made available in the OSC report. Now, prior to the release of this video demonstration of the use of Google Earth to plan attacks, in accordance with the OSC 2006 report, in the OSC-monitored online forums, discussions took place on the use of Google Earth as a GEOINT tool for terrorist planning. On August 5, 2005 the user "Al-Illiktrony" posted a message to the Islamic Renewal Organization forum titled A Gift for the Mujahidin, a Program To Enable You to Watch Cities of the World Via Satellite, in this post the author dedicated Google Earth to the mujahidin brothers and to Shaykh Muhammad al-Mas'ari, the post was replied in the forum by "Al-Mushtaq al-Jannah" warning that Google programs retain complete information about their users. This is a relevant issue, however, there are two caveats, given the amount of Google Earth users, it may be difficult for Google to flag a jihadist using the functionality in time to prevent an attack plan, one possible solution would be for Google to flag computers based on searched websites and locations, for instance to flag computers that visit certain critical sites, but this is a problem when landmarks are used, furthermore, and this is the second caveat, one may not use one's own computer to produce the search or even mask the IP address. On October 3, 2005, as described in the OSC 2006 report, in a reply to a posting by Saddam Al-Arab on the Baghdad al-Rashid forum requesting the identification of a roughly sketched map, "Almuhannad" posted a link to a site that provided a free download of Google Earth, suggesting that the satellite imagery from Google's service could help identify the sketch.

GraphDCA -- a Framework for Node Distribution Comparison in Real and Synthetic Graphs Artificial Intelligence

We argue that when comparing two graphs, the distribution of node structural features is more informative than global graph statistics which are often used in practice, especially to evaluate graph generative models. Thus, we present GraphDCA - a framework for evaluating similarity between graphs based on the alignment of their respective node representation sets. The sets are compared using a recently proposed method for comparing representation spaces, called Delaunay Component Analysis (DCA), which we extend to graph data. To evaluate our framework, we generate a benchmark dataset of graphs exhibiting different structural patterns and show, using three node structure feature extractors, that GraphDCA recognizes graphs with both similar and dissimilar local structure. We then apply our framework to evaluate three publicly available real-world graph datasets and demonstrate, using gradual edge perturbations, that GraphDCA satisfyingly captures gradually decreasing similarity, unlike global statistics. Finally, we use GraphDCA to evaluate two state-of-the-art graph generative models, NetGAN and CELL, and conclude that further improvements are needed for these models to adequately reproduce local structural features.

A Variational Edge Partition Model for Supervised Graph Representation Learning Machine Learning

Graph neural networks (GNNs), which propagate the node features through the edges and learn how to transform the aggregated features under label supervision, have achieved great success in supervised feature extraction for both node-level and graph-level classification tasks. However, GNNs typically treat the graph structure as given and ignore how the edges are formed. This paper introduces a graph generative process to model how the observed edges are generated by aggregating the node interactions over a set of overlapping node communities, each of which contributes to the edges via a logical OR mechanism. Based on this generative model, we partition each edge into the summation of multiple community-specific weighted edges and use them to define community-specific GNNs. A variational inference framework is proposed to jointly learn a GNN based inference network that partitions the edges into different communities, these community-specific GNNs, and a GNN based predictor that combines community-specific GNNs for the end classification task. Extensive evaluations on real-world graph datasets have verified the effectiveness of the proposed method in learning discriminative representations for both node-level and graph-level classification tasks.

Latent gaze information in highly dynamic decision-tasks Artificial Intelligence

Digitization is penetrating more and more areas of life. Tasks are increasingly being completed digitally, and are therefore not only fulfilled faster, more efficiently but also more purposefully and successfully. The rapid developments in the field of artificial intelligence in recent years have played a major role in this, as they brought up many helpful approaches to build on. At the same time, the eyes, their movements, and the meaning of these movements are being progressively researched. The combination of these developments has led to exciting approaches. In this dissertation, I present some of these approaches which I worked on during my Ph.D. First, I provide insight into the development of models that use artificial intelligence to connect eye movements with visual expertise. This is demonstrated for two domains or rather groups of people: athletes in decision-making actions and surgeons in arthroscopic procedures. The resulting models can be considered as digital diagnostic models for automatic expertise recognition. Furthermore, I show approaches that investigate the transferability of eye movement patterns to different expertise domains and subsequently, important aspects of techniques for generalization. Finally, I address the temporal detection of confusion based on eye movement data. The results suggest the use of the resulting model as a clock signal for possible digital assistance options in the training of young professionals. An interesting aspect of my research is that I was able to draw on very valuable data from DFB youth elite athletes as well as on long-standing experts in arthroscopy. In particular, the work with the DFB data attracted the interest of radio and print media, namely DeutschlandFunk Nova and SWR DasDing. All resulting articles presented here have been published in internationally renowned journals or at conferences.

The impact of feature importance methods on the interpretation of defect classifiers Artificial Intelligence

Abstract--Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance methods can lead to conclusion instabilities unless there is a strong agreement among different methods. Therefore, in this paper, we evaluate the agreement between the feature importance ranks associated with the studied classifiers through a case study of 18 software projects and six commonly used classifiers. We find that: 1) The computed feature importance ranks by CA and CS methods do not always strongly agree with each other. Such findings raise concerns about the stability of conclusions across replicated studies. We further observe that the commonly used defect datasets are rife with feature interactions and these feature interactions impact the computed feature importance ranks of the CS methods (not the CA methods). We demonstrate that removing these feature interactions, even with simple methods like CFS improves agreement between the computed feature importance ranks of CA and CS methods. In light of our findings, we provide guidelines for stakeholders and practitioners when performing model interpretation and directions for future research, e.g., future research is needed to investigate the impact of advanced feature interaction removal methods on computed feature importance ranks of different CS methods. We note, however, that a CS method is not always readily available for Defect classifiers are widely used by many large software corporations a given classifier. Defect classifiers are commonly and deep neural networks do not have a widely accepted CS interpreted to uncover insights to improve software quality. Therefore it is the feature importance ranks of different classifiers is pivotal that these generated insights are reliable. Such CA methods measure the contribution of each feature a feature importance method to compute a ranking of feature towards a classifier's predictions. These measure the contribution of each feature by effecting changes to feature importance ranks reflect the order in which the studied that particular feature in the dataset and observing its impact on features contribute to the predictive capability of the studied the outcome. The primary advantage of CA methods is that they classifier [14].

Transformers and the representation of biomedical background knowledge Artificial Intelligence

BioBERT and BioMegatron are Transformers models adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine - namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyse how the models behave with regard to biases and imbalances in the dataset.