In the scientific digital libraries, some papers from different research communities can be described by community-dependent keywords even if they share a semantically similar topic. Articles that are not tagged with enough keyword variations are poorly indexed in any information retrieval system which limits potentially fruitful exchanges between scientific disciplines. In this paper, we introduce a novel experimentally designed pipeline for multi-label semantic-based tagging developed for open-access metadata digital libraries. The approach starts by learning from a standard scientific categorization and a sample of topic tagged articles to find semantically relevant articles and enrich its metadata accordingly. Our proposed pipeline aims to enable researchers reaching articles from various disciplines that tend to use different terminologies. It allows retrieving semantically relevant articles given a limited known variation of search terms. In addition to achieving an accuracy that is higher than an expanded query based method using a topic synonym set extracted from a semantic network, our experiments also show a higher computational scalability versus other comparable techniques. We created a new benchmark extracted from the open-access metadata of a scientific digital library and published it along with the experiment code to allow further research in the topic.
Rushing, John (University of Alabama in Huntsville) | Berendes, Todd (University of Alabama in Huntsville) | Lin, Hong (University of Alabama in Huntsville) | Buntain, Cody (University of Alabama in Huntsville) | Graves, Sara (University of Alabama in Huntsville)
This paper describes the Spyglass tool, which is designed to help analysts explore very large collections of unstructured text documents. Spyglass uses a domain ontology to index documents, and provides retrieval and visualization services based on the ontology and the resulting index. The ontology based approach allows analysts to share information and helps to ensure consistency of results. The approach is also scalable and lends itself very well to parallel computation. The Spyglass system is described in detail and indexing and query results using a large set of sample documents are presented.
The University of North Texas (UNT) Libraries in partnership with the University of Illinois at Chicago were awarded a National Leadership Grant (IMLS:LG-71-17-0202-17) from the Institute of Museum and Library Services (IMLS) to research the efficacy of using machine-learning algorithms to identify and extract content-rich publications contained in web archives. With the increase of institutions that are collection web-published content into web archives, there has been growing interest in mining these web archives to extract publications or documents that align with existing collections or collection development policies. These identified publications could then be integrated into existing digital library collections where they would become first-order digital objects instead of content accessible only to discovery by traversing the web archive or though a well crafted full text search. This project is focusing on the first piece of this workflow, to identify the publications that exist and separate them from content that does not align with existing collections. To operationalize this research, the project is focusing on three primary use cases, including: extracting scholarly publications for an institutional repository from a university domain's web archive (unt.edu
Caragea, Cornelia (University of North Texas) | Wu, Jian (Pennsylvania State University) | Gollapalli, Sujatha Das (Institute for Infocomm Research, A*STAR) | Giles, C. Lee (Pennsylvania State University)
Online digital libraries make it easier for researchers to search for scientific information. They have been proven as powerful resources in many data mining, machine learning and information retrieval applications that require high-quality data. The quality of the data highly depends on the accuracy of classifiers that identify the types of documents that are crawled from the Web, e.g., as research papers, slides, books, etc., for appropriate indexing. These classifiers in turn depend on the choice of the feature representation. We propose novel features that result in high-accuracy classifiers for document type classification. Experimental results on several datasets show that our classifiers outperform models that are employed in current systems.
We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5-6 years. We show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. We also present AI technologies, implemented in table and algorithm search, that are special search modes in CiteSeerX. While it is challenging to rebuild a system like Cite-SeerX from scratch, many of these AI technologies are transferable to other digital libraries and search engines.