specter
Multi-Facet Blending for Faceted Query-by-Example Retrieval
Do, Heejin, Ryu, Sangwon, Kim, Jonghwi, Lee, Gary Geunbae
With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle's efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.
The Video Game Industry Is More Successful Than Ever. Why Are Its Workers Treated Like Garbage?
Video game workers--whatever their job, employer, or status--have clearly had enough. This month alone, the labor movement has made some of its biggest advancements ever in organizing the techies, artists, and creatives who keep the largest, most culturally significant sector of the global entertainment industry running and thriving. First, on July 19, came "wall-to-wall" union approval at Fallout-maker Bethesda Game Studios, which meant that everyone from engineers to artists could establish a comprehensive unit with the Communications Workers of America. They quickly earned recognition from parent company Microsoft, marking the first wall-to-wall effort to succeed at any of the Big Tech firm's gaming studios. On July 24, even more company workers got into the game.
AI on AI: Exploring the Utility of GPT as an Expert Annotator of AI Publications
Toney-Wails, Autumn, Schoeberl, Christian, Dunham, James
Identifying scientific publications that are within a dynamic field of research often requires costly annotation by subject-matter experts. Resources like widely-accepted classification criteria or field taxonomies are unavailable for a domain like artificial intelligence (AI), which spans emerging topics and technologies. We address these challenges by inferring a functional definition of AI research from existing expert labels, and then evaluating state-of-the-art chatbot models on the task of expert data annotation. Using the arXiv publication database as ground-truth, we experiment with prompt engineering for GPT chatbot models to identify an alternative, automated expert annotation pipeline that assigns AI labels with 94% accuracy. For comparison, we fine-tune SPECTER, a transformer language model pre-trained on scientific publications, that achieves 96% accuracy (only 2% higher than GPT) on classifying AI publications. Our results indicate that with effective prompt engineering, chatbots can be used as reliable data annotators even where subject-area expertise is required. To evaluate the utility of chatbot-annotated datasets on downstream classification tasks, we train a new classifier on GPT-labeled data and compare its performance to the arXiv-trained model. The classifier trained on GPT-labeled data outperforms the arXiv-trained model by nine percentage points, achieving 82% accuracy.
PaECTER: Patent-level Representation Learning using Citation-informed Transformers
Ghosh, Mainak, Erhardt, Sebastian, Rose, Michael E., Buunk, Erik, Harhoff, Dietmar
PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained language model (BERT for Patents) on our patent citation prediction test dataset on two different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners. PaECTER is available on Hugging Face.
SKT5SciSumm -- A Hybrid Generative Approach for Multi-Document Scientific Summarization
To, Huy Quoc, Tran, Hung-Nghiep, Greiner-Petter, Andr'e, Beierle, Felix, Aizawa, Akiko
Summarization for scientific text has shown significant benefits both for the research community and human society. Given the fact that the nature of scientific text is distinctive and the input of the multi-document summarization task is substantially long, the task requires sufficient embedding generation and text truncation without losing important information. To tackle these issues, in this paper, we propose SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS). We leverage the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences, allowing for efficient extractive summarization using k-means clustering. We employ the T5 family of models to generate abstractive summaries using extracted sentences. SKT5SciSumm achieves state-of-the-art performance on the Multi-XScience dataset. Through extensive experiments and evaluation, we showcase the benefits of our model by using less complicated models to achieve remarkable results, thereby highlighting its potential in advancing the field of multi-document summarization for scientific text.
OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement
Gao, Yang, Ma, Ji, Korotkov, Ivan, Hall, Keith, Alon, Dana, Metzler, Don
We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related works in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we pretrain science-specialized language models, and explore different strategies to derive "related" paper pairs to fine-tune the models, including using a mixture of citation, co-citation, and bibliographic-coupling pairs. To further improve the models' performance for non-English papers, we explore the use of generative language models to enrich the non-English papers with English summaries. This allows us to leverage the models' English capabilities to create better representations for non-English papers. Our best model significantly outperforms strong baselines by 7-16% (in mean average precision).
Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens
Seoh, Ronald, Chang, Haw-Shiuan, McCallum, Andrew
Many useful tasks on scientific documents, such as topic classification and citation prediction, involve corpora that span multiple scientific domains. Typically, such tasks are accomplished by representing the text with a vector embedding obtained from a Transformer's single CLS token. In this paper, we argue that using multiple CLS tokens could make a Transformer better specialize to multiple scientific domains. We present Multi2SPE: it encourages each of multiple CLS tokens to learn diverse ways of aggregating token embeddings, then sums them up together to create a single vector representation. We also propose our new multi-domain benchmark, Multi-SciDocs, to test scientific paper vector encoders under multi-domain settings. We show that Multi2SPE reduces error by up to 25 percent in multi-domain citation prediction, while requiring only a negligible amount of computation in addition to one BERT forward pass.
MIReAD: Simple Method for Learning High-quality Representations from Scientific Documents
Razdaibiedina, Anastasia, Brechalov, Alexander
Learning semantically meaningful representations from scientific documents can facilitate academic literature search and improve performance of recommendation systems. Pre-trained language models have been shown to learn rich textual representations, yet they cannot provide powerful document-level representations for scientific articles. We propose MIReAD, a simple method that learns high-quality representations of scientific papers by fine-tuning transformer model to predict the target journal class based on the abstract. We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes. We show that MIReAD produces representations that can be used for similar papers retrieval, topic categorization and literature search. Our proposed approach outperforms six existing models for representation learning on scientific documents across four evaluation standards.
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
Ostendorff, Malte, Rethmeier, Nils, Augenstein, Isabelle, Gipp, Bela, Rehm, Georg
Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.
Why the specter of blockchain will be more important to humankind than AI
I'm back with another translation, this time of a lecture that was given in Chinese at Peking University on May 5, 2018 by Wang Feng, founder of Huoxing24 (a Chinese blockchain news site), LineKong (a Chinese media and entertainment product company) and partner at GeekFounders (a tech investment firm). I found his perspective about blockchain a solid and concise consolidation of the leading thoughts about the history of blockchain and what its significance is. The way he explains it illuminates both a global way of thinking about the technology, and also a Chinese perspective of what that thinking means (starting with a straight-up quote from the Communist Manisfesto...) Having also worked on AI products myself, it was a fresh take on the significance of AI as compared to blockchain, and why people have strong feelings about the impact of these two disruptive technologies. Note that I'm not exactly translating this in a way of a direct transcription, but rather as a narrative, so I skip all the parts addressing the audience like "hello fellow alumni" and whatnot. I'll pepper my thoughts and comments throughout, so please do think about and debate those with me:) Again, kudos to the originator, and you can credit me only for the translation, dramatic flourishes and contextual comments.