Information Retrieval
The Morning After: Reddit is blocking AI search engines that don't cough up for access
When Reddit said last month it would block unauthorized data scraping from its site, most of us assumed it was to tackle chatbot training. It turns out the site/service/fandom battleground also appears to be blocking search engines other than Brave and Google, the latter of which reportedly inked a deal earlier this year with Reddit worth 60 million annually. A Reddit spokesperson told Engadget the empty search results are because these engines won't agree to the company's requirements for AI training. The company says it's in discussions with several of them. Bing and DuckDuckGo both appear to be affected.
Constructing the CORD-19 Vaccine Dataset
Singh, Manisha, Sharma, Divy, Ma, Alonso, Tyree, Bridget, Mitchell, Margaret
We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one used in the CORD-19 Kaggle challenge [Goldbloom et al., 2022]. For further evaluation, sequential sentence classification was performed on each paper's abstract using the model from Dernoncourt et al. [2016]. We partially hand annotated the training dataset and used a pre-trained BERT-PubMed layer. 'CORD- 19-Vaccination' contains 30k research papers and can be immensely valuable for NLP research such as text mining, information extraction, and question answering, specific to the domain of COVID-19 vaccine research.
Search engines that don't pay up can't index Reddit content
When Reddit said last month that it would block unauthorized data scraping from its site, everyone's (rightful) first reaction was "AI, AI, AI." However, now that the change has taken effect, chatbot makers aren't the only ones being locked out. The widely used forum also appears to be blocking all search engines other than Google, which reportedly inked a deal earlier this year with Reddit worth 60 million annually. The publication reported that DuckDuckGo produced seven links without any descriptions, only providing the note, "We would like to show you a description here but the site won't allow us." The engine now appears to have removed even those, as our test only produced an empty page, reading, "no results found."
Improving ICD coding using Chapter based Named Entities and Attentional Models
Beeravolu, Abhijith R., Jonkman, Mirjam, Azam, Sami, De Boer, Friso
Recent advancements in natural language processing (NLP) have led to automation in various domains. However, clinical NLP often relies on benchmark datasets that may not reflect real-world scenarios accurately. Automatic ICD coding, a vital NLP task, typically uses outdated and imbalanced datasets like MIMIC-III, with existing methods yielding micro-averaged F1 scores between 0.4 and 0.7 due to many false positives. Our research introduces an enhanced approach to ICD coding that improves F1 scores by using chapter-based named entities and attentional models. This method categorizes discharge summaries into ICD-9 Chapters and develops attentional models with chapter-specific data, eliminating the need to consider external data for code identification. For categorization, we use Chapter-IV to de-bias and influence key entities and weights without neural networks, creating accurate thresholds and providing interpretability for human validation. Post-validation, we develop attentional models for three frequent and three non-frequent codes from Chapter-IV using Bidirectional-Gated Recurrent Units (GRUs) with Attention and Transformer with Multi-head Attention architectures. The average Micro-F1 scores of 0.79 and 0.81 from these models demonstrate significant performance improvements in ICD coding.
The Morning After: Condรฉ Nast is the latest media company to accuse AI search engine Perplexity of plagiarism
Condรฉ Nast, the media giant that owns The New Yorker, Vogue and Wired, has sent a cease-and-desist letter to AI-powered search startup Perplexity, according to The Information. The letter, sent on Monday, demanded Perplexity stop using content from Condรฉ Nast publications in its AI-generated responses and accused the startup of plagiarism. It comes a month after Forbes took similar action. Condรฉ Nast CEO Roger Lynch has warned "many" media companies could face financial ruin in the time it would take for litigation against generative AI companies to conclude. Lynch has called upon Congress to take "immediate action."
Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data
Schelb, Julian, Ulloa, Roberto, Spitz, Andreas
Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.
Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval
Lee, Daeun, Choi, Jaewoong, Mizuseki, Hiroshi, Lee, Byungju
Recent studies have increasingly applied natural language processing (NLP) to automatically extract experimental research data from the extensive battery materials literature. Despite the complex process involved in battery manufacturing -- from material synthesis to cell assembly -- there has been no comprehensive study systematically organizing this information. In response, we propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for the automatic extraction of end-to-end battery recipes, validated using a case study on batteries containing LiFePO4 cathode material. We report machine learning-based paper filtering models, screening 2,174 relevant papers from the keyword-based search results, and unsupervised topic models to identify 2,876 paragraphs related to cathode synthesis and 2,958 paragraphs related to cell assembly. Then, focusing on the two topics, two deep learning-based named entity recognition models are developed to extract a total of 30 entities -- including precursors, active materials, and synthesis methods -- achieving F1 scores of 88.18% and 94.61%. The accurate extraction of entities enables the systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our protocol and results offer valuable insights into specific trends, such as associations between precursor materials and synthesis methods, or combinations between different precursor materials. We anticipate that our findings will serve as a foundational knowledge base for facilitating battery-recipe information retrieval. The proposed protocol will significantly accelerate the review of battery material literature and catalyze innovations in battery design and development.
'Google says I'm a dead physicist': is the world's biggest search engine broken?
I didn't know I was dead until I saw it on Google. When I searched my name, there it was: a picture of my smiling face next to the text "Tom Faber was a physicist and publisher, and he was a university lecturer at Cambridge for 35 years". Apparently I died on 27 July 2004, aged 77. This was news to me. The problem was the picture. When you search the name of a notable person, Google may create what it calls a "knowledge panel", a little box with basic information taken from Wikipedia. Somewhere along the way, the algorithm had confused pictures of my face with the biography of another man who shared my name. According to his obituary, he was "a distinguished physicist with a literary hinterland". Google provides a feedback form to resolve this type of bug. I filled it in several times, but it made no difference.
Improving Retrieval in Sponsored Search by Leveraging Query Context Signals
Mohankumar, Akash Kumar, K, Gururaj, Madan, Gagan, Singh, Amit
Accurately retrieving relevant bid keywords for user queries is critical in Sponsored Search but remains challenging, particularly for short, ambiguous queries. Existing dense and generative retrieval models often fail to capture nuanced user intent in these cases. To address this, we propose an approach to enhance query understanding by augmenting queries with rich contextual signals derived from web search results and large language models, stored in an online cache. Specifically, we use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations that clarify user intent. These signals are efficiently integrated through a Fusion-in-Decoder based Unity architecture, enabling both dense and generative retrieval with serving costs on par with traditional context-free models. To address scenarios where context is unavailable in the cache, we introduce context glancing, a curriculum learning strategy that improves model robustness and performance even without contextual signals during inference. Extensive offline experiments demonstrate that our context-aware approach substantially outperforms context-free models. Furthermore, online A/B testing on a prominent search engine across 160+ countries shows significant improvements in user engagement and revenue.
An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry
Meisenbacher, Stephen, Schopf, Tim, Yan, Weixin, Holl, Patrick, Matthes, Florian
The task of $\textit{keyword extraction}$ is often an important initial step in unsupervised information extraction, forming the basis for tasks such as topic modeling or document classification. While recent methods have proven to be quite effective in the extraction of keywords, the identification of $\textit{class-specific}$ keywords, or only those pertaining to a predefined class, remains challenging. In this work, we propose an improved method for class-specific keyword extraction, which builds upon the popular $\textbf{KeyBERT}$ library to identify only keywords related to a class described by $\textit{seed keywords}$. We test this method using a dataset of German business registry entries, where the goal is to classify each business according to an economic sector. Our results reveal that our method greatly improves upon previous approaches, setting a new standard for $\textit{class-specific}$ keyword extraction.