Goto

Collaborating Authors

 Ray, Soumya


Optimizing Drug Design by Merging Generative AI With Active Learning Frameworks

arXiv.org Artificial Intelligence

Traditional drug discovery programs are being transformed by the advent of machine learning methods. Among these, Generative AI methods (GM) have gained attention due to their ability to design new molecules and enhance specific properties of existing ones. However, current GM methods have limitations, such as low affinity towards the target, unknown ADME/PK properties, or the lack of synthetic tractability. To improve the applicability domain of GM methods, we have developed a workflow based on a variational autoencoder coupled with active learning steps. The designed GM workflow iteratively learns from molecular metrics, including drug likeliness, synthesizability, similarity, and docking scores. In addition, we also included a hierarchical set of criteria based on advanced molecular modeling simulations during a final selection step. We tested our GM workflow on two model systems, CDK2 and KRAS. In both cases, our model generated chemically viable molecules with a high predicted affinity toward the targets. Particularly, the proportion of high-affinity molecules inferred by our GM workflow was significantly greater than that in the training data. Notably, we also uncovered novel scaffolds significantly dissimilar to those known for each target. These results highlight the potential of our GM workflow to explore novel chemical space for specific targets, thereby opening up new possibilities for drug discovery endeavors.


Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing

arXiv.org Artificial Intelligence

The use of crowdworkers in NLP research is growing rapidly, in tandem with the exponential increase in research production in machine learning and AI. Ethical discussion regarding the use of crowdworkers within the NLP research community is typically confined in scope to issues related to labor conditions such as fair pay. We draw attention to the lack of ethical considerations related to the various tasks performed by workers, including labeling, evaluation, and production. We find that the Final Rule, the common ethical framework used by researchers, did not anticipate the use of online crowdsourcing platforms for data collection, resulting in gaps between the spirit and practice of human-subjects ethics in NLP research. We enumerate common scenarios where crowdworkers performing NLP tasks are at risk of harm. We thus recommend that researchers evaluate these risks by considering the three ethical principles set up by the Belmont Report. We also clarify some common misconceptions regarding the Institutional Review Board (IRB) application. We hope this paper will serve to reopen the discussion within our community regarding the ethical use of crowdworkers.


Reactive Supervision: A New Method for Collecting Sarcasm Data

arXiv.org Artificial Intelligence

Sarcasm detection is an important task in affective computing, requiring large amounts of labeled data. We introduce reactive supervision, a novel data collection method that utilizes the dynamics of online conversations to overcome the limitations of existing data collection techniques. We use the new method to create and release a first-of-its-kind large dataset of tweets with sarcasm perspective labels and new contextual features. The dataset is expected to advance sarcasm detection research. Our method can be adapted to other affective computing domains, thus opening up new research opportunities.


Gesture Annotation With a Visual Search Engine for Multimodal Communication Research

AAAI Conferences

Human communication is multimodal and includes elements such as gesture and facial expression along with spoken language. Modern technology makes it feasible to capture all such aspects of communication in natural settings. As a result, similar to fields such as genetics, astronomy and neuroscience, scholars in areas such as linguistics and communication studies are on the verge of a data-driven revolution in their fields. These new approaches require analytical support from machine learning and artificial intelligence to develop tools to help process the vast data repositories. The Distributed Little Red Hen Lab project is an international team of interdisciplinary researchers building a large-scale infrastructure for data-driven multimodal communications research. In this paper, we describe a machine learning system developed to automatically annotate a large database of television program videos as part of this project. The annotations mark regions where people or speakers are on screen along with body part motions including head, hand and shoulder motion. We also annotate a specific class of gestures known as timeline gestures. An existing gesture annotation tool, ELAN, can be used with these annotations to quickly locate gestures of interest. Finally, we provide an update mechanism for the system based on human feedback. We empirically evaluate the accuracy of the system as well as present data from pilot human studies to show its effectiveness at aiding gesture scholars in their work.


Automated Volumetric Intravascular Plaque Classification Using Optical Coherence Tomography

AI Magazine

An estimated 17.5 million people died from a cardiovascular disease in 2012, representing 31 percent of all global deaths. Most acute coronary events result from rupture of the protective fibrous cap overlying an atherosclerotic plaque. The task of early identification of plaque types that can potentially rupture is, therefore, of great importance. The state-of-the-art approach to imaging blood vessels is intravascular optical coherence tomography (IVOCT). However, currently, this is an offline approach where the images are first collected and then manually analyzed an image at a time to identify regions at risk of thrombosis. This process is extremely laborious, time consuming and prone to human error. We are building a system that, when complete, will provide interactive 3D visualization of a blood vessel as an IVOCT is in progress. The visualization will highlight different plaque types and enable quick identification of regions at risk for thrombosis. In this paper, we describe our approach, focusing on machine learning methods that are a key enabling technology. Our empirical results using real OCT data show that our approach can identify different plaque types efficiently with high accuracy across multiple patients.


Automated Volumetric Intravascular Plaque Classification Using Optical Coherence Tomography (OCT)

AAAI Conferences

An estimated 17.5 million people died from a cardiovascular disease in 2012, representing 31% of all global deaths. Most acute coronary events result from rupture of the protective fibrous cap overlying an atherosclerotic plaque. The task of early identification of plaque types that can potentially rupture is, therefore, of great importance. The state-of-the-art approach to imaging blood vessels is intravascular optical coherence tomography (IVOCT). However, currently, this is an offline approach where the images are first collected and then manually analyzed a frame at a time to identify regions at risk of thrombosis. This process is extremely laborious, time consuming and prone to human error. We are building a system that, when complete, will provide interactive 3D visualization of a blood vessel as an IVOCT is in progress. The visualization will highlight different plaque types and enable quick identification of regions at risk for thrombosis. In this paper, we describe our approach, focusing on machine learning methods that are a key enabling technology. Our empirical results using real OCT data show that our approach can identify different plaque types efficiently with high accuracy across multiple patients.


Learning Instance Concepts from Multiple-Instance Data with Bags as Distributions

AAAI Conferences

We analyze and evaluate a generative process for multiple-instance learning (MIL) in which bags are distributions over instances. We show that our generative process contains as special cases generative models explored in prior work, while excluding scenarios known to be hard for MIL. Further, under the mild assumption that every negative instance is observed with nonzero probability in some negative bag, we show that it is possible to learn concepts that accurately label instances from MI data in this setting. Finally, we show that standard supervised approaches can learn concepts with low area-under-ROC error from MI data in this setting. We validate this surprising result with experiments using several synthetic and real-world MI datasets that have been annotated with instance labels.


Detection and Prediction of Adverse and Anomalous Events in Medical Robots

AAAI Conferences

Adverse and anomalous (A&A) events are a serious concern in medical robots. We describe a system that can rapidly detect such events and predict their occurrence. As part of this system, we describe simulation, data collection and user interface tools we build for a robot for small animal biopsies. The data we collect consists of both the hardware state of the robot and variables in the software controller. We use this data to train dynamic Bayesian network models of the joint hardware-software state-space dynamics of the robot. Our empirical evaluation shows that (i) our models can accurately model normal behavior of the robot, (ii) they can rapidly detect anomalous behavior once it starts, (iii) they can accurately predict a future A&A event within a time window of it starting and (iv) the use of additional software variables beyond the hardware state of the robot is important in being able to detect and predict certain kinds of events.


SMILe: Shuffled Multiple-Instance Learning

AAAI Conferences

Resampling techniques such as bagging are often used in supervised learning to produce more accurate classifiers. In this work, we show that multiple-instance learning admits a different form of resampling, which we call "shuffling." In shuffling, we resample instances in such a way that the resulting bags are likely to be correctly labeled. We show that resampling results in both a reduction of bag label noise and a propagation of additional informative constraints to a multiple-instance classifier. We empirically evaluate shuffling in the context of multiple-instance classification and multiple-instance active learning and show that the approach leads to significant improvements in accuracy.


SEPIA: A Scalable Game Environment for Artificial Intelligence Teaching and Research

AAAI Conferences

We describe a game environment we have developed that we call the Strategy Engine for Programming Intelligent Agents (SEPIA). SEPIA is based on real-time strategy games, but modified extensively to preferentially support the development of artificial agents rather than human play. Through flexible configuration options, SEPIA is designed to be pedagogically scalable: suitable for use at the undergraduate and graduate levels, and also as a research testbed. We also describe assignments and our experiences with this environment in undergraduate and graduate classes.