Shah, Kushal
Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset
Nihalani, Rahul, Shah, Kushal
This paper presents an improved LLM based model for Grammatical Error Detection (GED), which is a very challenging and equally important problem for many applications. The traditional approach to GED involved hand-designed features, but recently, Neural Networks (NN) have automated the discovery of these features, improving performance in GED. Traditional rule-based systems have an F1 score of 0.50-0.60 and earlier machine learning models give an F1 score of 0.65-0.75, including decision trees and simple neural networks. Previous deep learning models, for example, Bi-LSTM, have reported F1 scores within the range from 0.80 to 0.90. In our study, we have fine-tuned various transformer models using the Lang8 dataset rigorously cleaned by us. In our experiments, the BERT-base-uncased model gave an impressive performance with an F1 score of 0.91 and accuracy of 98.49% on training data and 90.53% on testing data, also showcasing the importance of data cleaning. Increasing model size using BERT-large-uncased or RoBERTa-large did not give any noticeable improvements in performance or advantage for this task, underscoring that larger models are not always better. Our results clearly show how far rigorous data cleaning and simple transformer-based models can go toward significantly improving the quality of GED.
BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery
John, Peter St., Lin, Dejun, Binder, Polina, Greaves, Malcolm, Shah, Vega, John, John St., Lange, Adrian, Hsu, Patrick, Illango, Rajesh, Ramanathan, Arvind, Anandkumar, Anima, Brookes, David H, Busia, Akosua, Mahajan, Abhishaike, Malina, Stephen, Prasad, Neha, Sinai, Sam, Edwards, Lindsay, Gaudelet, Thomas, Regep, Cristian, Steinegger, Martin, Rost, Burkhard, Brace, Alexander, Hippe, Kyle, Naef, Luca, Kamata, Keisuke, Armstrong, George, Boyd, Kevin, Cao, Zhonglin, Chou, Han-Yi, Chu, Simon, Costa, Allan dos Santos, Darabi, Sajad, Dawson, Eric, Didi, Kieran, Fu, Cong, Geiger, Mario, Gill, Michelle, Hsu, Darren, Kaushik, Gagan, Korshunova, Maria, Kothen-Hill, Steven, Lee, Youhan, Liu, Meng, Livne, Micha, McClure, Zachary, Mitchell, Jonathan, Moradzadeh, Alireza, Mosafi, Ohad, Nashed, Youssef, Paliwal, Saee, Peng, Yuxing, Rabhi, Sara, Ramezanghorbani, Farhad, Reidenbach, Danny, Ricketts, Camir, Roland, Brian, Shah, Kushal, Shimko, Tyler, Sirelkhatim, Hassan, Srinivasan, Savitha, Stern, Abraham C, Toczydlowska, Dorota, Veccham, Srimukh Prasad, Venanzi, Niccolรฒ Alberto Elia, Vorontsov, Anton, Wilber, Jared, Wilkinson, Isabel, Wong, Wei Jing, Xue, Eva, Ye, Cory, Yu, Xin, Zhang, Yang, Zhou, Guoqing, Zandstein, Becca, Dallago, Christian, Trentini, Bruno, Kucukbenli, Emine, Paliwal, Saee, Rvachov, Timur, Calleja, Eddie, Israeli, Johnny, Clifford, Harry, Haukioja, Risto, Haemel, Nicholas, Tretina, Kyle, Tadimeti, Neha, Costa, Anthony B
Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.
Formal Ontology Learning on Factual IS-A Corpus in English using Description Logics
Dasgupta, Sourish, Padia, Ankur, Shah, Kushal, Majumder, Prasenjit
Ontology Learning (OL) is the computational task of generating a knowledge base in the form of an ontology given an unstructured corpus whose content is in natural language (NL). Several works can be found in this area most of which are limited to statistical and lexico-syntactic pattern matching based techniques Light-Weight OL. These techniques do not lead to very accurate learning mostly because of several linguistic nuances in NL. Formal OL is an alternative (less explored) methodology were deep linguistics analysis is made using theory and tools found in computational linguistics to generate formal axioms and definitions instead simply inducing a taxonomy. In this paper we propose "Description Logic (DL)" based formal OL framework for learning factual IS-A type sentences in English. We claim that semantic construction of IS-A sentences is non trivial. Hence, we also claim that such sentences requires special studies in the context of OL before any truly formal OL can be proposed. We introduce a learner tool, called DLOL_IS-A, that generated such ontologies in the owl format. We have adopted "Gold Standard" based OL evaluation on IS-A rich WCL v.1.1 dataset and our own Community representative IS-A dataset. We observed significant improvement of DLOL_IS-A when compared to the light-weight OL tool Text2Onto and formal OL tool FRED.
Description Logics based Formalization of Wh-Queries
Dasgupta, Sourish, KaPatel, Rupali, Padia, Ankur, Shah, Kushal
The problem of Natural Language Query Formalization (NLQF) is to translate a given user query in natural language (NL) into a formal language so that the semantic interpretation has equivalence with the NL interpretation. Formalization of NL queries enables logic based reasoning during information retrieval, database query, question-answering, etc. Formalization also helps in Web query normalization and indexing, query intent analysis, etc. In this paper we are proposing a Description Logics based formal methodology for wh-query intent (also called desire) identification and corresponding formal translation. We evaluated the scalability of our proposed formalism using Microsoft Encarta 98 query dataset and OWL-S TC v.4.0 dataset.
DLOLIS-A: Description Logic based Text Ontology Learning
Dasgupta, Sourish, Padia, Ankur, Shah, Kushal, KaPatel, Rupali, Majumder, Prasenjit
Ontology Learning has been the subject of intensive study for the past decade. Researchers in this field have been motivated by the possibility of automatically building a knowledge base on top of text documents so as to support reasoning based knowledge extraction. While most works in this field have been primarily statistical (known as light-weight Ontology Learning) not much attempt has been made in axiomatic Ontology Learning (called heavy-weight Ontology Learning) from Natural Language text documents. Heavy-weight Ontology Learning supports more precise formal logic-based reasoning when compared to statistical ontology learning. In this paper we have proposed a sound Ontology Learning tool DLOL_(IS-A) that maps English language IS-A sentences into their equivalent Description Logic (DL) expressions in order to automatically generate a consistent pair of T-box and A-box thereby forming both regular (definitional form) and generalized (axiomatic form) DL ontology. The current scope of the paper is strictly limited to IS-A sentences that exclude the possible structures of: (i) implicative IS-A sentences, and (ii) "Wh" IS-A questions. Other linguistic nuances that arise out of pragmatics and epistemic of IS-A sentences are beyond the scope of this present work. We have adopted Gold Standard based Ontology Learning evaluation on chosen IS-A rich Wikipedia documents.