Ramis, Cesar Berrospi
Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion
Livathinos, Nikolaos, Auer, Christoph, Lysak, Maksym, Nassar, Ahmed, Dolfi, Michele, Vagenas, Panos, Ramis, Cesar Berrospi, Omenetti, Matteo, Dinkla, Kasper, Kim, Yusik, Gupta, Shubham, de Lima, Rafael Teixeira, Weber, Valery, Morin, Lucas, Meijer, Ingmar, Kuropiatnyk, Viktor, Staar, Peter W. J.
We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. Docling is released as a Python package and can be used as a Python API or as a CLI tool. Docling's modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. Docling has been already integrated in other popular open-source frameworks (e.g., LangChain, LlamaIndex, spaCy), making it a natural fit for the processing of documents and the development of high-end applications. The open-source community has fully engaged in using, promoting, and developing for Docling, which gathered 10k stars on GitHub in less than a month and was reported as the No. 1 trending repository in GitHub worldwide in November 2024.
Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
Mishra, Lokesh, Dhibi, Sohayl, Kim, Yusik, Ramis, Cesar Berrospi, Gupta, Shubham, Dolfi, Michele, Staar, Peter
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
INDUS: Effective and Efficient Language Models for Scientific Applications
Bhattacharjee, Bishwaranjan, Trivedi, Aashka, Muraoka, Masayasu, Ramasubramanian, Muthukumaran, Udagawa, Takuma, Gurung, Iksha, Zhang, Rong, Dandala, Bharath, Ramachandran, Rahul, Maskey, Manil, Bugbee, Kaylin, Little, Mike, Fancher, Elizabeth, Sanders, Lauren, Costes, Sylvain, Blanco-Cuaresma, Sergi, Lockhart, Kelly, Allen, Thomas, Grezes, Felix, Ansdell, Megan, Accomazzi, Alberto, El-Kurdi, Yousef, Wertheimer, Davis, Pfitzmann, Birgit, Ramis, Cesar Berrospi, Dolfi, Michele, de Lima, Rafael Teixeira, Vagenas, Panagiotis, Mukkavilli, S. Karthik, Staar, Peter, Vahidinia, Sanaz, McGranaghan, Ryan, Mehrabian, Armin, Lee, Tsendgar
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.
Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery
Lam, Hoang Thanh, Sbodio, Marco Luca, Galindo, Marcos Martínez, Zayats, Mykhaylo, Fernández-Díaz, Raúl, Valls, Víctor, Picco, Gabriele, Ramis, Cesar Berrospi, López, Vanessa
Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. In this study, we demonstrate that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, we can further enrich the representation and achieve state-of-the-art results for drug-target binding affinity prediction in the established Therapeutic Data Commons (TDC) benchmarks. We release a set of multimodal knowledge graphs, integrating data from seven public data sources, and containing over 30 million triples. Our intention is to foster additional research to explore how multimodal knowledge enhanced protein/molecule embeddings can improve prediction tasks, including prediction of binding affinity. We also release some pretrained models learned from our multimodal knowledge graphs, along with source code for running standard benchmark tasks for prediction of biding affinity.