Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery

Öztürk, Hakime, Özgür, Arzucan, Schwaller, Philippe, Laino, Teodoro, Ozkirimli, Elif

arXiv.org Machine Learning 

Biochemical methods that measure affinity and biophysical methods that describe the interaction in atomistic level detail have provided valuable information toward a mechanistic explanation for bimolecular recognition [1]. However, more often than not, compounds with drug potential are discovered serendipitously or by phenotypic drug discovery [2] since this highly specific interaction is still difficult to predict [3]. Protein structure based computational strategies such as docking [4], ultra-large library docking for discovering new chemotypes [5], and molecular dynamics simulations [4] or ligand based strategies such as quantitative structure-activity relationship (QSAR) [6, 7], and molecular similarity [8] have been powerful at narrowing down the list of compounds to be tested experimentally. With the increase in available data, machine learning and deep learning architectures are also starting to play a significant role in cheminformatics and drug discovery [9]. These approaches often require extensive computational resources or they are limited by the availability of 3D information. On the other hand, text based representations of biochemical entities are more readily available as evidenced by the 19,588 biomolecular complexes (3D structures) in PDB-Bind [10] (accessed on Nov 13, 2019) compared with 561,356 (manually annotated and reviewed) protein sequences in Uniprot [11] (accessed on Nov 13, 2019) or 97 million compounds in Pubchem [12] (accessed on Nov 13, 2019). The advances in natural language processing (NLP) methodologies make processing of text based representations of biomolecules an area of intense research interest. The discipline of natural language processing (NLP) comprises a variety of methods that explore a large amount of textual data in order to bring unstructured, latent (or hidden) knowledge to the fore [13]. Advances in this field are beneficial for tasks that use language (textual data) to build insight.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found