AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Ta'keed: The First Generative Fact-Checking System for Arabic Claims

Althabiti, Saud, Alsalka, Mohammad Ammar, Atwell, Eric

arXiv.org Artificial IntelligenceJan-25-2024

This paper introduces Ta'keed, an explainable Arabic automatic fact-checking system. While existing research often focuses on classifying claims as "True" or "False," there is a limited exploration of generating explanations for claim credibility, particularly in Arabic. Ta'keed addresses this gap by assessing claim truthfulness based on retrieved snippets, utilizing two main components: information retrieval and LLM-based claim verification. We compiled the ArFactEx, a testing gold-labelled dataset with manually justified references, to evaluate the system. The initial model achieved a promising F1 score of 0.72 in the classification task. Meanwhile, the system's generated explanations are compared with gold-standard explanations syntactically and semantically. The study recommends evaluating using semantic similarities, resulting in an average cosine similarity score of 0.76. Additionally, we explored the impact of varying snippet quantities on claim classification accuracy, revealing a potential correlation, with the model using the top seven hits outperforming others with an F1 score of 0.77.

dataset, explanation, information, (15 more...)

arXiv.org Artificial Intelligence

2401.14067

Country:

Europe > United Kingdom > England > West Yorkshire > Leeds (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Middle East > Saudi Arabia > Mecca Province > Jeddah (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry:

Media > News (1.00)
Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)

Add feedback

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Wu, Siwei, Li, Yizhi, Zhu, Kang, Zhang, Ge, Liang, Yiming, Ma, Kaijing, Xiao, Chenghao, Zhang, Haoran, Yang, Bohao, Chen, Wenhu, Huang, Wenhao, Moubayed, Noura Al, Fu, Jie, Lin, Chenghua

arXiv.org Artificial IntelligenceJan-24-2024

Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at https://github.com/Wusiwei0410/SciMMIR.

dataset, representation, subcategory, (15 more...)

arXiv.org Artificial Intelligence

2401.13478

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

DREditor: An Time-efficient Approach for Building a Domain-specific Dense Retrieval Model

Huang, Chen, Feng, Duanyu, Lei, Wenqiang, Lv, Jiancheng

arXiv.org Artificial IntelligenceJan-23-2024

Deploying dense retrieval models efficiently is becoming increasingly important across various industries. This is especially true for enterprise search services, where customizing search engines to meet the time demands of different enterprises in different domains is crucial. Motivated by this, we develop a time-efficient approach called DREditor to edit the matching rule of an off-the-shelf dense retrieval model to suit a specific domain. This is achieved by directly calibrating the output embeddings of the model using an efficient and effective linear mapping. This mapping is powered by an edit operator that is obtained by solving a specially constructed least squares problem. Compared to implicit rule modification via long-time finetuning, our experimental results show that DREditor provides significant advantages on different domain-specific datasets, dataset sources, retrieval models, and computing devices. It consistently enhances time efficiency by 100-300 times while maintaining comparable or even superior retrieval performance. In a broader context, we take the first step to introduce a novel embedding calibration approach for the retrieval task, filling the technical blank in the current field of embedding calibration. This approach also paves the way for building domain-specific dense retrieval models efficiently and inexpensively.

dataset, dr model, dreditor, (14 more...)

arXiv.org Artificial Intelligence

2401.1254

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.93)
(3 more...)

Add feedback

Comparing Human-Centered Language Modeling: Is it Better to Model Groups, Individual Traits, or Both?

Soni, Nikita, Balasubramanian, Niranjan, Schwartz, H. Andrew, Hovy, Dirk

arXiv.org Artificial IntelligenceJan-23-2024

Natural language processing has made progress in incorporating human context into its models, but whether it is more effective to use group-wise attributes (e.g., over-45-year-olds) or model individuals remains open. Group attributes are technically easier but coarse: not all 45-year-olds write the same way. In contrast, modeling individuals captures the complexity of each person's identity. It allows for a more personalized representation, but we may have to model an infinite number of users and require data that may be impossible to get. We compare modeling human context via group attributes, individual users, and combined approaches. Combining group and individual features significantly benefits user-level regression tasks like age estimation or personality assessment from a user's documents. Modeling individual users significantly improves the performance of single document-level classification tasks like stance and topic detection. We also find that individual-user modeling does well even without user's historical data.

computational linguistic, human context, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2401.12492

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Beijing > Beijing (0.04)
North America > United States > Maryland > Baltimore (0.04)
(12 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.54)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Cognitive Science > Simulation of Human Behavior (0.48)

Add feedback

An Empirical Study on Compliance with Ranking Transparency in the Software Documentation of EU Online Platforms

Sovrano, Francesco, Lognoul, Michaël, Bacchelli, Alberto

arXiv.org Artificial IntelligenceJan-23-2024

Compliance with the European Union's Platform-to-Business (P2B) Regulation is challenging for online platforms, and assessing their compliance can be difficult for public authorities. This is partly due to the lack of automated tools for assessing the information (e.g., software documentation) platforms provide concerning ranking transparency. Our study tackles this issue in two ways. First, we empirically evaluate the compliance of six major platforms (Amazon, Bing, Booking, Google, Tripadvisor, and Yahoo), revealing substantial differences in their documentation. Second, we introduce and test automated compliance assessment tools based on ChatGPT and information retrieval technology. These tools are evaluated against human judgments, showing promising results as reliable proxies for compliance assessments. Our findings could help enhance regulatory compliance and align with the United Nations Sustainable Development Goal 10.3, which seeks to reduce inequality, including business disparities, on these platforms.

documentation, explanation, platform, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3639475.3640112

2312.14794

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Portugal > Lisbon > Lisbon (0.05)
Europe > Belgium > Wallonia > Namur Province > Namur (0.04)
(8 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Industry:

Law (1.00)
Government > Regional Government > Europe Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.39)

Add feedback

LearnedWMP: Workload Memory Prediction Using Distribution of Query Templates

Quader, Shaikh, Jaramillo, Andres, Mukhopadhyay, Sumona, Abuoda, Ghadeer, Zuzarte, Calisto, Kalmuk, David, Litoiu, Marin, Papagelis, Manos

arXiv.org Artificial IntelligenceJan-22-2024

In a modern DBMS, working memory is frequently the limiting factor when processing in-memory analytic query operations such as joins, sorting, and aggregation. Existing resource estimation approaches for a DBMS estimate the resource consumption of a query by computing an estimate of each individual database operator in the query execution plan. Such an approach is slow and error-prone as it relies upon simplifying assumptions, such as uniformity and independence of the underlying data. Additionally, the existing approach focuses on individual queries separately and does not factor in other queries in the workload that may be executed concurrently. In this research, we are interested in query performance optimization under concurrent execution of a batch of queries (a workload). Specifically, we focus on predicting the memory demand for a workload rather than providing separate estimates for each query within it. We introduce the problem of workload memory prediction and formalize it as a distribution regression problem. We propose Learned Workload Memory Prediction (LearnedWMP) to improve and simplify estimating the working memory demands of workloads. Through a comprehensive experimental evaluation, we show that LearnedWMP reduces the memory estimation error of the state-of-the-practice method by up to 47.6%. Compared to an alternative single-query model, during training and inferencing, the LearnedWMP model and its variants were 3x to 10x faster. Moreover, LearnedWMP-based models were at least 50% smaller in most cases. Overall, the results demonstrate the advantages of the LearnedWMP approach and its potential for a broader impact on query performance optimization.

query, template, workload, (15 more...)

arXiv.org Artificial Intelligence

2401.12103

Country:

North America > Canada (0.05)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.56)

Technology:

Information Technology > Databases (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Personality Trait Inference Via Mobile Phone Sensors: A Machine Learning Approach

Sze, Wun Yung Shaney, Herrero, Maryglen Pearl, Garriga, Roger

arXiv.org Artificial IntelligenceJan-22-2024

This study provides evidence that personality can be reliably predicted from activity data collected through mobile phone sensors. Employing a set of well informed indicators calculable from accelerometer records and movement patterns, we were able to predict users' personality up to a 0.78 F1 score on a two class problem. Given the fast growing number of data collected from mobile phones, our novel personality indicators open the door to exciting avenues for future research in social sciences. Our results reveal distinct behavioral patterns that proved to be differentially predictive of big five personality traits. They potentially enable cost effective, questionnaire free investigation of personality related questions at an unprecedented scale. We show how a combination of rich behavioral data obtained with smartphone sensing and the use of machine learning techniques can help to advance personality research and can inform both practitioners and researchers about the different behavioral patterns of personality. These findings have practical implications for organizations harnessing mobile sensor data for personality assessment, guiding the refinement of more precise and efficient prediction models in the future.

extraversion, personality trait, prediction, (16 more...)

arXiv.org Artificial Intelligence

2401.10305

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
North America > United States > District of Columbia > Washington (0.04)
Europe > United Kingdom (0.04)
Asia > India > Bihar > Patna (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.30)

Add feedback

Navigating the Thin Line: Examining User Behavior in Search to Detect Engagement and Backfire Effects

Cau, F. M., Tintarev, N.

arXiv.org Artificial IntelligenceJan-20-2024

Opinionated users often seek information that aligns with their preexisting beliefs while dismissing contradictory evidence due to confirmation bias. This conduct hinders their ability to consider alternative stances when searching the web. Despite this, few studies have analyzed how the diversification of search results on disputed topics influences the search behavior of highly opinionated users. To this end, we present a preregistered user study (n = 257) investigating whether different levels (low and high) of bias metrics and search results presentation (with or without AI-predicted stances labels) can affect the stance diversity consumption and search behavior of opinionated users on three debated topics (i.e., atheism, intellectual property rights, and school uniforms). Our results show that exposing participants to (counter-attitudinally) biased search results increases their consumption of attitude-opposing content, but we also found that bias was associated with a trend toward overall fewer interactions within the search page. We also found that 19% of users interacted with queries and search pages but did not select any search results. When we removed these participants in a post-hoc analysis, we found that stance labels increased the diversity of stances consumed by users, particularly when the search results were biased. Our findings highlight the need for future research to explore distinct search scenario settings to gain insight into opinionated users' behavior.

diversity, participant, search result, (16 more...)

arXiv.org Artificial Intelligence

2401.11201

Country:

North America > United States > New York > New York County > New York City (0.06)
Europe > Netherlands > Limburg > Maastricht (0.04)
Europe > Switzerland (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Education (0.67)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.30)

Add feedback

An Information Retrieval and Extraction Tool for Covid-19 Related Papers

Pivetta, Marcos V. L.

arXiv.org Artificial IntelligenceJan-19-2024

Background: The COVID-19 pandemic has caused severe impacts on health systems worldwide. Its critical nature and the increased interest of individuals and organizations to develop countermeasures to the problem has led to a surge of new studies in scientific journals. Objetive: We sought to develop a tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this paper is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and hightlight relevant entities in text. Method: We applied Latent Dirichlet Allocation (LDA) to model, based on research aspects, the topics of all English abstracts in CORD-19. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers. Results: Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool. Conclusion: We emphasize the need of new automated tools to help researchers find relevant COVID-19 documents, in addition to automatically extracting useful information contained in them. Our work suggests that combining different algorithms and models could lead to new ways of browsing COVID-19 paper data.

anuary 31, cord-19, research aspect, (15 more...)

arXiv.org Artificial Intelligence

2401.1643

Country:

South America > Brazil > Rio Grande do Sul (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.89)

Add feedback

Efficient approximation of Earth Mover's Distance Based on Nearest Neighbor Search

Meng, Guangyu, Zhou, Ruyu, Liu, Liu, Liang, Peixian, Liu, Fang, Chen, Danny, Niemier, Michael, Hu, X. Sharon

arXiv.org Artificial IntelligenceJan-19-2024

Earth Mover's Distance (EMD) is an important similarity measure between two distributions, used in computer vision and many other application domains. However, its exact calculation is computationally and memory intensive, which hinders its scalability and applicability for large-scale problems. Various approximate EMD algorithms have been proposed to reduce computational costs, but they suffer lower accuracy and may require additional memory usage or manual parameter tuning. In this paper, we present a novel approach, NNS-EMD, to approximate EMD using Nearest Neighbor Search (NNS), in order to achieve high accuracy, low time complexity, and high memory efficiency. The NNS operation reduces the number of data points compared in each NNS iteration and offers opportunities for parallel processing. We further accelerate NNS-EMD via vectorization on GPU, which is especially beneficial for large datasets. We compare NNS-EMD with both the exact EMD and state-of-the-art approximate EMD algorithms on image classification and retrieval tasks. We also apply NNS-EMD to calculate transport mapping and realize color transfer between images. NNS-EMD can be 44x to 135x faster than the exact EMD implementation, and achieves superior accuracy, speedup, and memory efficiency over existing approximate EMD methods.

algorithm, dataset, nn-emd, (14 more...)

arXiv.org Artificial Intelligence

2401.07378

Country:

Europe > Belgium (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
(2 more...)

Add feedback