Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI

Lee, Junhyeong, Yuk, Jong Min, Lee, Chan-Woo

arXiv.org Artificial Intelligence 

The construction of experimental datasets is essential for expanding the scope of data-driven scientific discovery. Recent adva nces in natural language pro cessing (NLP) have facilitated automatic extraction of structured data from uns tructured scientific literature. While existing approaches--multi-step and direct methods--offer va luable capabilities, they also come with limitations when applied independently. He re, we propose a novel hybrid text-mining framework that integrates the advantages of both methods to convert unstructured scientific text into structured data. Our approach first tran sforms raw text into entity-recognized text, and subsequently into structured form. Furthermore, beyond the overall data structuring framework, we also enhance entity recogniti on performance by introducing an entity marker--a simple yet effective technique that uses sym bolic annotations to highlight target entities. Specifically, our entity marker-based hybrid approach not onl y consistently outperforms previous entity recognition approaches across three benchmark datasets (MatScholar, SOFC, and SOFC slot NER) but also improve the quality of final st ructured data--yielding up to a 58% improvement in entity-level F1 score and up to 83% improveme nt in relation-level F1 score compared to direct approach.