AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling

Cheng, Zifeng, Zhou, Qingyu, Jiang, Zhiwei, Zhao, Xuemin, Cao, Yunbo, Gu, Qing

arXiv.org Artificial IntelligenceJul-19-2023

Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples. Existing methods solve the data scarcity problem mainly by designing token-level or span-level labeling models based on metric learning. However, these methods are only trained at a single granularity (i.e., either token level or span level) and have some weaknesses of the corresponding granularity. In this paper, we first unify token and span level supervisions and propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. CDAP contains the token-level and span-level networks, jointly trained at different granularities. To align the outputs of two networks, we further propose a consistent loss to enable them to learn from each other. During the inference phase, we propose a consistent greedy inference algorithm that first adjusts the predicted probability and then greedily selects non-overlapping spans with maximum probability. Extensive experiments show that our model achieves new state-of-the-art results on three benchmark datasets.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.07946

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Monaco > Monaco (0.04)
(7 more...)

Genre:

Overview (0.67)
Research Report (0.50)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Persian topic detection based on Human Word association and graph embedding

Ranjbar-Khadivi, Mehrdad, Akbarpour, Shahin, Feizi-Derakhshi, Mohammad-Reza, Anari, Babak

arXiv.org Artificial IntelligenceJul-18-2023

In this paper, we propose a framework to detect topics in social media based on Human Word Association. Identifying topics discussed in these media has become a critical and significant challenge. Most of the work done in this area is in English, but much has been done in the Persian language, especially microblogs written in Persian. Also, the existing works focused more on exploring frequent patterns or semantic relationships and ignored the structural methods of language. In this paper, a topic detection framework using HWA, a method for Human Word Association, is proposed. This method uses the concept of imitation of mental ability for word association. This method also calculates the Associative Gravity Force that shows how words are related. Using this parameter, a graph can be generated. The topics can be extracted by embedding this graph and using clustering methods. This approach has been applied to a Persian language dataset collected from Telegram. Several experimental studies have been performed to evaluate the proposed framework's performance. Experimental results show that this approach works better than other topic detection methods.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2302.09775

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Middle East > Iran > East Azerbaijan Province > Tabriz (0.04)
Asia > Azerbaijan (0.04)
(3 more...)

Genre: Research Report > New Finding (0.54)

Industry: Information Technology (0.48)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

A Human Word Association based model for topic detection in social networks

Khadivi, Mehrdad Ranjbar, Akbarpour, Shahin, Feizi-Derakhshi, Mohammad-Reza, Anari, Babak

arXiv.org Artificial IntelligenceJul-18-2023

With the widespread use of social networks, detecting the topics discussed in these networks has become a significant challenge. The current works are mainly based on frequent pattern mining or semantic relations, and the language structure is not considered. The meaning of language structural methods is to discover the relationship between words and how humans understand them. Therefore, this paper uses the Concept of the Imitation of the Mental Ability of Word Association to propose a topic detection framework in social networks. This framework is based on the Human Word Association method. A special extraction algorithm has also been designed for this purpose. The performance of this method is evaluated on the FA-CUP dataset. It is a benchmark dataset in the field of topic detection. The results show that the proposed method is a good improvement compared to other methods, based on the Topic-recall and the keyword F1 measure. Also, most of the previous works in the field of topic detection are limited to the English language, and the Persian language, especially microblogs written in this language, is considered a low-resource language. Therefore, a data set of Telegram posts in the Farsi language has been collected. Applying the proposed method to this dataset also shows that this method works better than other topic detection methods.

information retrieval, machine learning, pattern recognition, (21 more...)

arXiv.org Artificial Intelligence

2301.13066

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Iran > East Azerbaijan Province > Tabriz (0.04)
Asia > Azerbaijan (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Services (1.00)
Government > Regional Government > North America Government > United States Government (0.46)
Leisure & Entertainment > Sports > Soccer (0.35)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Cross Modal Data Discovery over Structured and Unstructured Data Lakes

Eltabakh, Mohamed Y., Kunjir, Mayuresh, Elmagarmid, Ahmed, Ahmad, Mohammad Shahmeer

arXiv.org Artificial IntelligenceJul-16-2023

Organizations are collecting increasingly large amounts of data for data driven decision making. These data are often dumped into a centralized repository, e.g., a data lake, consisting of thousands of structured and unstructured datasets. Perversely, such mixture of datasets makes the problem of discovering elements (e.g., tables or documents) that are relevant to a user's query or an analytical task very challenging. Despite the recent efforts in data discovery, the problem remains widely open especially in the two fronts of (1) discovering relationships and relatedness across structured and unstructured datasets where existing techniques suffer from either scalability, being customized for a specific problem type (e.g., entity matching or data integration), or demolishing the structural properties on its way, and (2) developing a holistic system for integrating various similarity measurements and sketches in an effective way to boost the discovery accuracy. In this paper, we propose a new data discovery system, named CMDL, for addressing these two limitations. CMDL supports the data discovery process over both structured and unstructured data while retaining the structural properties of tables.

data mining, discovery, machine learning, (27 more...)

arXiv.org Artificial Intelligence

2306.00932

Country:

Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)
Asia > Middle East > Qatar (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Information Technology > Services (0.67)
Health & Medicine > Therapeutic Area > Oncology (0.46)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(3 more...)

Add feedback

Pre-trained Language Models in Biomedical Domain: A Systematic Survey

Wang, Benyou, Xie, Qianqian, Pei, Jiahuan, Chen, Zhihong, Tiwari, Prayag, Li, Zhao, fu, Jie

arXiv.org Artificial IntelligenceJul-16-2023

Pre-trained language models (PLMs) have been the de facto paradigm for most natural language processing (NLP) tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science (CS) communities propose various PLMs trained on biomedical datasets, e.g., biomedical text, electronic health records, protein, and DNA sequences for various biomedical tasks. However, the cross-discipline characteristics of biomedical PLMs hinder their spreading among communities; some existing works are isolated from each other without comprehensive comparison and discussions. It expects a survey that not only systematically reviews recent advances of biomedical PLMs and their applications but also standardizes terminology and benchmarks. In this paper, we summarize the recent progress of pre-trained language models in the biomedical domain and their applications in biomedical downstream tasks. Particularly, we discuss the motivations and propose a taxonomy of existing biomedical PLMs. Their applications in biomedical downstream tasks are exhaustively discussed. At last, we illustrate various limitations and future trends, which we hope can provide inspiration for the future research of the research community.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2110.05006

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)
Asia > Middle East > Israel (0.04)
(11 more...)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.92)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(4 more...)

Add feedback

Intuitive Access to Smartphone Settings Using Relevance Model Trained by Contrastive Learning

Kim, Joonyoung, Lee, Kangwook, Shin, Haebin, Lee, Hurnjoo, Kang, Sechun, Choi, Byunguk, Shin, Dong, Lee, Joohyung

arXiv.org Artificial IntelligenceJul-15-2023

The more new features that are being added to smartphones, the harder it becomes for users to find them. This is because the feature names are usually short, and there are just too many to remember. In such a case, the users may want to ask contextual queries that describe the features they are looking for, but the standard term frequency-based search cannot process them. This paper presents a novel retrieval system for mobile features that accepts intuitive and contextual search queries. We trained a relevance model via contrastive learning from a pre-trained language model to perceive the contextual relevance between query embeddings and indexed mobile features. Also, to make it run efficiently on-device using minimal resources, we applied knowledge distillation to compress the model without degrading much performance. To verify the feasibility of our method, we collected test queries and conducted comparative experiments with the currently deployed search baselines. The results show that our system outperforms the others on contextual sentence queries and even on usual keyword-based queries.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.09177

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.85)

Industry: Education (0.47)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

Add feedback

QontSum: On Contrasting Salient Content for Query-focused Summarization

Sotudeh, Sajad, Goharian, Nazli

arXiv.org Artificial IntelligenceJul-14-2023

Query-focused summarization (QFS) is a challenging task in natural language processing that generates summaries to address specific queries. The broader field of Generative Information Retrieval (Gen-IR) aims to revolutionize information extraction from vast document corpora through generative approaches, encompassing Generative Document Retrieval (GDR) and Grounded Answer Retrieval (GAR). This paper highlights the role of QFS in Grounded Answer Generation (GAR), a key subdomain of Gen-IR that produces human-readable answers in direct correspondence with queries, grounded in relevant documents. In this study, we propose QontSum, a novel approach for QFS that leverages contrastive learning to help the model attend to the most relevant regions of the input document. We evaluate our approach on a couple of benchmark datasets for QFS and demonstrate that it either outperforms existing state-of-the-art or exhibits a comparable performance with considerably reduced computational cost through enhancements in the fine-tuning stage, rather than relying on large-scale pre-training experiments, which is the focus of current SOTA. Moreover, we conducted a human study and identified improvements in the relevance of generated summaries to the posed queries without compromising fluency. We further conduct an error analysis study to understand our model's limitations and propose avenues for future research.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2307.07586

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(5 more...)

Genre:

Research Report > Promising Solution (0.88)
Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)

Add feedback

Hybrid moderation in the newsroom: Recommending featured posts to content moderators

Waterschoot, Cedric, Bosch, Antal van den

arXiv.org Artificial IntelligenceJul-14-2023

Online news outlets are grappling with the moderation of user-generated content within their comment section. We present a recommender system based on ranking class probabilities to support and empower the moderator in choosing featured posts, a time-consuming task. By combining user and textual content features we obtain an optimal classification F1-score of 0.44 on the test set. Furthermore, we observe an optimum mean NDCG@5 of 0.87 on a large set of validation articles. As an expert evaluation, content moderators assessed the output of a random selection of articles by choosing comments to feature based on the recommendations, which resulted in a NDCG score of 0.83. We conclude that first, adding text features yields the best score and second, while choosing featured content remains somewhat subjective, content moderators found suitable comments in all but one evaluated recommendations. We end the paper by analyzing our best-performing model, a step towards transparency and explainability in hybrid content moderation.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.07317

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.34)

Add feedback

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Golchin, Shahriar, Surdeanu, Mihai, Tavabi, Nazgol, Kiapour, Ata

arXiv.org Artificial IntelligenceJul-14-2023

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.0716

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Arizona > Pima County > Tucson (0.14)
North America > United States > Oregon (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.69)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.71)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Testing different Log Bases For Vector Model Weighting Technique

Assaf, Kamel

arXiv.org Artificial IntelligenceJul-12-2023

Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.

information retrieval, natural language, test collection, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijnlc.2023.12301

2307.06213

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback