AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

The Morning After: Reddit is blocking AI search engines that don't cough up for access

EngadgetJul-25-2024, 11:15:37 GMT

When Reddit said last month it would block unauthorized data scraping from its site, most of us assumed it was to tackle chatbot training. It turns out the site/service/fandom battleground also appears to be blocking search engines other than Brave and Google, the latter of which reportedly inked a deal earlier this year with Reddit worth 60 million annually. A Reddit spokesperson told Engadget the empty search results are because these engines won't agree to the company's requirements for AI training. The company says it's in discussions with several of them. Bing and DuckDuckGo both appear to be affected.

ai search engine, reddit, search engine, (5 more...)

Engadget

Country: North America > United States > Arizona (0.07)

Industry:

Media > News (1.00)
Leisure & Entertainment > Games > Computer Games (0.35)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.63)

Add feedback

Constructing the CORD-19 Vaccine Dataset

Singh, Manisha, Sharma, Divy, Ma, Alonso, Tyree, Bridget, Mitchell, Margaret

arXiv.org Artificial IntelligenceJul-25-2024

We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one used in the CORD-19 Kaggle challenge [Goldbloom et al., 2022]. For further evaluation, sequential sentence classification was performed on each paper's abstract using the model from Dernoncourt et al. [2016]. We partially hand annotated the training dataset and used a pre-trained BERT-PubMed layer. 'CORD- 19-Vaccination' contains 30k research papers and can be immensely valuable for NLP research such as text mining, information extraction, and question answering, specific to the domain of COVID-19 vaccine research.

cord-19 dataset, cord-19-vaccination, dataset, (12 more...)

arXiv.org Artificial Intelligence

2407.18471

Country:

South America > Brazil (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Maryland (0.04)
(6 more...)

Genre: Research Report (0.84)

Industry: Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)

Add feedback

Search engines that don't pay up can't index Reddit content

EngadgetJul-24-2024, 17:29:49 GMT

When Reddit said last month that it would block unauthorized data scraping from its site, everyone's (rightful) first reaction was "AI, AI, AI." However, now that the change has taken effect, chatbot makers aren't the only ones being locked out. The widely used forum also appears to be blocking all search engines other than Google, which reportedly inked a deal earlier this year with Reddit worth 60 million annually. The publication reported that DuckDuckGo produced seven links without any descriptions, only providing the note, "We would like to show you a description here but the site won't allow us." The engine now appears to have removed even those, as our test only produced an empty page, reading, "no results found."

engine, index reddit content, reddit, (8 more...)

Engadget

Industry: Media > News (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.66)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.55)

Add feedback

Improving ICD coding using Chapter based Named Entities and Attentional Models

Beeravolu, Abhijith R., Jonkman, Mirjam, Azam, Sami, De Boer, Friso

arXiv.org Artificial IntelligenceJul-24-2024

Recent advancements in natural language processing (NLP) have led to automation in various domains. However, clinical NLP often relies on benchmark datasets that may not reflect real-world scenarios accurately. Automatic ICD coding, a vital NLP task, typically uses outdated and imbalanced datasets like MIMIC-III, with existing methods yielding micro-averaged F1 scores between 0.4 and 0.7 due to many false positives. Our research introduces an enhanced approach to ICD coding that improves F1 scores by using chapter-based named entities and attentional models. This method categorizes discharge summaries into ICD-9 Chapters and develops attentional models with chapter-specific data, eliminating the need to consider external data for code identification. For categorization, we use Chapter-IV to de-bias and influence key entities and weights without neural networks, creating accurate thresholds and providing interpretability for human validation. Post-validation, we develop attentional models for three frequent and three non-frequent codes from Chapter-IV using Bidirectional-Gated Recurrent Units (GRUs) with Attention and Transformer with Multi-head Attention architectures. The average Micro-F1 scores of 0.79 and 0.81 from these models demonstrate significant performance improvements in ICD coding.

discharge summary, icd code, mimic, (15 more...)

arXiv.org Artificial Intelligence

2407.1723

Genre: Research Report (0.82)

Industry: Health & Medicine > Health Care Providers & Services (0.98)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.89)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.76)

Add feedback

The Morning After: Condé Nast is the latest media company to accuse AI search engine Perplexity of plagiarism

EngadgetJul-23-2024, 11:15:59 GMT

Condé Nast, the media giant that owns The New Yorker, Vogue and Wired, has sent a cease-and-desist letter to AI-powered search startup Perplexity, according to The Information. The letter, sent on Monday, demanded Perplexity stop using content from Condé Nast publications in its AI-generated responses and accused the startup of plagiarism. It comes a month after Forbes took similar action. Condé Nast CEO Roger Lynch has warned "many" media companies could face financial ruin in the time it would take for litigation against generative AI companies to conclude. Lynch has called upon Congress to take "immediate action."

ai search engine perplexity, latest media company, plagiarism, (11 more...)

Engadget

Country: North America > United States > New York (0.26)

Industry:

Media (1.00)
Law (0.97)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Add feedback

Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data

Schelb, Julian, Ulloa, Roberto, Spitz, Andreas

arXiv.org Artificial IntelligenceJul-23-2024

Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.

classification, classifier, webpage, (15 more...)

arXiv.org Artificial Intelligence

2407.16516

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(15 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Law (0.68)
Information Technology (0.67)
Energy > Renewable (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(5 more...)

Add feedback

Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval

Lee, Daeun, Choi, Jaewoong, Mizuseki, Hiroshi, Lee, Byungju

arXiv.org Artificial IntelligenceJul-22-2024

Recent studies have increasingly applied natural language processing (NLP) to automatically extract experimental research data from the extensive battery materials literature. Despite the complex process involved in battery manufacturing -- from material synthesis to cell assembly -- there has been no comprehensive study systematically organizing this information. In response, we propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for the automatic extraction of end-to-end battery recipes, validated using a case study on batteries containing LiFePO4 cathode material. We report machine learning-based paper filtering models, screening 2,174 relevant papers from the keyword-based search results, and unsupervised topic models to identify 2,876 paragraphs related to cathode synthesis and 2,958 paragraphs related to cell assembly. Then, focusing on the two topics, two deep learning-based named entity recognition models are developed to extract a total of 30 entities -- including precursors, active materials, and synthesis methods -- achieving F1 scores of 88.18% and 94.61%. The accurate extraction of entities enables the systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our protocol and results offer valuable insights into specific trends, such as associations between precursor materials and synthesis methods, or combinations between different precursor materials. We anticipate that our findings will serve as a foundational knowledge base for facilitating battery-recipe information retrieval. The proposed protocol will significantly accelerate the review of battery material literature and catalyze innovations in battery design and development.

cathode material synthesis, information, material synthesis, (13 more...)

arXiv.org Artificial Intelligence

2407.15459

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.86)

Industry:

Energy > Energy Storage (1.00)
Electrical Industrial Apparatus (1.00)
Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

'Google says I'm a dead physicist': is the world's biggest search engine broken?

The GuardianJul-20-2024, 11:00:35 GMT

I didn't know I was dead until I saw it on Google. When I searched my name, there it was: a picture of my smiling face next to the text "Tom Faber was a physicist and publisher, and he was a university lecturer at Cambridge for 35 years". Apparently I died on 27 July 2004, aged 77. This was news to me. The problem was the picture. When you search the name of a notable person, Google may create what it calls a "knowledge panel", a little box with basic information taken from Wikipedia. Somewhere along the way, the algorithm had confused pictures of my face with the biography of another man who shared my name. According to his obituary, he was "a distinguished physicist with a literary hinterland". Google provides a feedback form to resolve this type of bug. I filled it in several times, but it made no difference.

google, information, search engine, (16 more...)

The Guardian

Country:

North America > United States (1.00)
Asia > India > Karnataka (0.04)

Genre: Personal (0.48)

Industry:

Law (1.00)
Information Technology > Services (1.00)
Government > Regional Government > North America Government > United States Government (0.94)
Education (0.86)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.45)

Add feedback

Improving Retrieval in Sponsored Search by Leveraging Query Context Signals

Mohankumar, Akash Kumar, K, Gururaj, Madan, Gagan, Singh, Amit

arXiv.org Artificial IntelligenceJul-19-2024

Accurately retrieving relevant bid keywords for user queries is critical in Sponsored Search but remains challenging, particularly for short, ambiguous queries. Existing dense and generative retrieval models often fail to capture nuanced user intent in these cases. To address this, we propose an approach to enhance query understanding by augmenting queries with rich contextual signals derived from web search results and large language models, stored in an online cache. Specifically, we use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations that clarify user intent. These signals are efficiently integrated through a Fusion-in-Decoder based Unity architecture, enabling both dense and generative retrieval with serving costs on par with traditional context-free models. To address scenarios where context is unavailable in the cache, we introduce context glancing, a curriculum learning strategy that improves model robustness and performance even without contextual signals during inference. Extensive offline experiments demonstrate that our context-aware approach substantially outperforms context-free models. Furthermore, online A/B testing on a prominent search engine across 160+ countries shows significant improvements in user engagement and revenue.

augmented unity, query, retrieval, (15 more...)

arXiv.org Artificial Intelligence

2407.14346

Country: Asia > India (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.52)

Add feedback

An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry

Meisenbacher, Stephen, Schopf, Tim, Yan, Weixin, Holl, Patrick, Matthes, Florian

arXiv.org Artificial IntelligenceJul-19-2024

The task of $\textit{keyword extraction}$ is often an important initial step in unsupervised information extraction, forming the basis for tasks such as topic modeling or document classification. While recent methods have proven to be quite effective in the extraction of keywords, the identification of $\textit{class-specific}$ keywords, or only those pertaining to a predefined class, remains challenging. In this work, we propose an improved method for class-specific keyword extraction, which builds upon the popular $\textbf{KeyBERT}$ library to identify only keywords related to a class described by $\textit{seed keywords}$. We test this method using a dataset of German business registry entries, where the goal is to classify each business according to an economic sector. Our results reveal that our method greatly improves upon previous approaches, setting a new standard for $\textit{class-specific}$ keyword extraction.

extraction, keyword, seed keyword, (13 more...)

arXiv.org Artificial Intelligence

2407.14085

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(8 more...)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback