AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Dynamic Algorithms for Matroid Submodular Maximization

Banihashem, Kiarash, Biabani, Leyla, Goudarzi, Samira, Hajiaghayi, MohammadTaghi, Jabbarzade, Peyman, Monemizadeh, Morteza

arXiv.org Artificial IntelligenceDec-26-2023

Submodular maximization under matroid and cardinality constraints are classical problems with a wide range of applications in machine learning, auction theory, and combinatorial optimization. In this paper, we consider these problems in the dynamic setting, where (1) we have oracle access to a monotone submodular function $f: 2^{V} \rightarrow \mathbb{R}^+$ and (2) we are given a sequence $\mathcal{S}$ of insertions and deletions of elements of an underlying ground set $V$. We develop the first fully dynamic $(4+\epsilon)$-approximation algorithm for the submodular maximization problem under the matroid constraint using an expected worst-case $O(k\log(k)\log^3{(k/\epsilon)})$ query complexity where $0 < \epsilon \le 1$. This resolves an open problem of Chen and Peng (STOC'22) and Lattanzi et al. (NeurIPS'20). As a byproduct, for the submodular maximization under the cardinality constraint $k$, we propose a parameterized (by the cardinality constraint $k$) dynamic algorithm that maintains a $(2+\epsilon)$-approximate solution of the sequence $\mathcal{S}$ at any time $t$ using an expected worst-case query complexity $O(k\epsilon^{-1}\log^2(k))$. This is the first dynamic algorithm for the problem that has a query complexity independent of the size of ground set $V$.

algorithm, invariant hold, matroidconstructlevel, (14 more...)

arXiv.org Artificial Intelligence

2306.00959

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
(45 more...)

Genre: Research Report (0.49)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.75)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.48)

Add feedback

Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine

Kaneda, Kanta, Nagashima, Shunya, Korekata, Ryosuke, Kambara, Motonari, Sugiura, Komei

arXiv.org Artificial IntelligenceDec-25-2023

Domestic service robots offer a solution to the increasing demand for daily care and support. A human-in-the-loop approach that combines automation and operator intervention is considered to be a realistic approach to their use in society. Therefore, we focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting, which we define as the learning-to-rank physical objects (LTRPO) task. For example, given the instruction "Please go to the dining room which has a round table. Pick up the bottle on it," the model is required to output a ranked list of target objects that the operator/user can select. In this paper, we propose MultiRankIt, which is a novel approach for the LTRPO task. MultiRankIt introduces the Crossmodal Noun Phrase Encoder to model the relationship between phrases that contain referring expressions and the target bounding box, and the Crossmodal Region Feature Encoder to model the relationship between the target object and multiple images of its surrounding contextual environment. Additionally, we built a new dataset for the LTRPO task that consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects. We validated our model on the dataset and it outperformed the baseline method in terms of the mean reciprocal rank and recall@k. Furthermore, we conducted physical experiments in a setting where a domestic service robot retrieved everyday objects in a standardized domestic environment, based on users' instruction in a human--in--the--loop setting. The experimental results demonstrate that the success rate for object retrieval achieved 80%. Our code is available at https://github.com/keio-smilab23/MultiRankIt.

dataset, instruction, ltrpo task, (15 more...)

arXiv.org Artificial Intelligence

2312.15844

Country: Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.50)

Add feedback

Data Bias Management

Communications of the ACMDec-22-2023, 05:45:20 GMT

The presence of bias in data has led to a lot of research being conducted to understand the impact of bias on machine learning (ML) models and data-driven decision-making systems.2 Research has focused on questions related to the fairness in the decisions taken by models trained with biased data, and on designing methods to increase the transparency of automated decision-making processes so that possible bias issues may be easily spotted and "fixed" by removing bias. Recent approaches taken in the literature to deal with data bias first aim to understand the cause of the problem (for example, a subset of the population being underrepresented in the training dataset) and then propose and evaluate an ad hoc intervention to reduce or remove the bias from the system (for example, by selecting which additional training data items to label in order to rebalance the dataset and increase equality--that is, a balanced representation of classes--rather than equity--that is, overrepresenting the disadvantaged subset of the population. Example research on bias removal include work looking at how to remove bias from learned word embeddings. Bolukbasi et al.3 defined a methodology for modifying an embedding representation to remove gender stereotypes.

dataset, end user, human annotator, (14 more...)

Communications of the ACM

Country:

Europe > Italy (0.05)
Oceania > Australia > Queensland (0.04)
North America > United States (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.30)

Add feedback

Semantic Parsing for Complex Data Retrieval: Targeting Query Plans vs. SQL for No-Code Access to Relational Databases

Eyal, Ben, Bachar, Amir, Haroche, Ophir, Elhadad, Michael

arXiv.org Artificial IntelligenceDec-22-2023

Large Language Models (LLMs) have spurred progress in text-to-SQL, the task of generating SQL queries from natural language questions based on a given database schema. Despite the declarative nature of SQL, it continues to be a complex programming language. In this paper, we investigate the potential of an alternative query language with simpler syntax and modular specification of complex queries. The purpose is to create a query language that can be learned more easily by modern neural semantic parsing architectures while also enabling non-programmers to better assess the validity of the query plans produced by an interactive query plan assistant. The proposed alternative query language is called Query Plan Language (QPL). It is designed to be modular and can be translated into a restricted form of SQL Common Table Expressions (CTEs). The aim of QPL is to make complex data retrieval accessible to non-programmers by allowing users to express their questions in natural language while also providing an easier-to-verify target language. The paper demonstrates how neural LLMs can benefit from QPL's modularity to generate complex query plans in a compositional manner. This involves a question decomposition strategy and a planning stage. We conduct experiments on a version of the Spider text-to-SQL dataset that has been converted to QPL. The hierarchical structure of QPL programs enables us to measure query complexity naturally. Based on this assessment, we identify the low accuracy of existing text-to-SQL systems on complex compositional queries. We present ways to address the challenge of complex queries in an iterative, user-controlled manner, using fine-tuned LLMs and a variety of prompting strategies in a compositional manner.

computational linguistic, query, sql query, (14 more...)

arXiv.org Artificial Intelligence

2312.14798

Country:

North America > United States > Washington > King County > Seattle (0.05)
North America > Dominican Republic (0.04)
North America > United States > New York > New York County > New York City (0.04)
(9 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

ROS package search for robot software development: a knowledge graph-based approach

Wang, Shuo, Mao, Xinjun, Yang, Shuo, Wu, Menghan, Zhang, Zhang

arXiv.org Artificial IntelligenceDec-22-2023

ROS (Robot Operating System) packages have become increasingly popular as a type of software artifact that can be effectively reused in robotic software development. Indeed, finding suitable ROS packages that closely match the software's functional requirements from the vast number of available packages is a nontrivial task using current search methods. The traditional search methods for ROS packages often involve inputting keywords related to robotic tasks into general-purpose search engines or code hosting platforms to obtain approximate results of all potentially suitable ROS packages. However, the accuracy of these search methods remains relatively low because the task-related keywords may not precisely match the functionalities offered by the ROS packages. To improve the search accuracy of ROS packages, this paper presents a novel semantic-based search approach that relies on the semantic-level ROS Package Knowledge Graph (RPKG) to automatically retrieve the most suitable ROS packages. Firstly, to construct the RPKG, we employ multi-dimensional feature extraction techniques to extract semantic concepts from the dataset of ROS package text descriptions. The semantic features extracted from this process result in a substantial number of entities and relationships. Subsequently, we create a robot domain-specific small corpus and further fine-tune a pre-trained language model, BERT-ROS, to generate embeddings that effectively represent the semantics of the extracted features. These embeddings play a crucial role in facilitating semantic-level understanding and comparisons during the ROS package search process within the RPKG. Secondly, we introduce a novel semantic matching-based search algorithm that incorporates the weighted similarities of multiple features from user search queries, which searches out more accurate ROS packages than the traditional keyword search method.

accuracy, ro package, ro package search, (12 more...)

arXiv.org Artificial Intelligence

2312.14781

Country:

South America > Chile (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Multi-Agent Join

Ghadakchi, Vahid, Xie, Mian, Termehchy, Arash, Doskenov, Bakhtiyar, Srikhakollu, Bharghav, Haque, Summit, Wang, Huazheng

arXiv.org Artificial IntelligenceDec-21-2023

It is crucial to provide real-time performance in many applications, such as interactive and exploratory data analysis. In these settings, users often need to view subsets of query results quickly. It is challenging to deliver such results over large datasets for relational operators over multiple relations, such as join. Join algorithms usually spend a long time on scanning and attempting to join parts of relations that may not generate any result. Current solutions usually require lengthy and repeated preprocessing, which is costly and may not be possible to do in many settings. Also, they often support restricted types of joins. In this paper, we outline a novel approach for achieving efficient join processing in which a scan operator of the join learns during query execution, the portions of its relations that might satisfy the join predicate. We further improve this method using an algorithm in which both scan operators collaboratively learn an efficient join execution strategy. We also show that this approach generalizes traditional and non-learning methods for joining. Our extensive empirical studies using standard benchmarks indicate that this approach outperforms similar methods considerably.

algorithm, relation, tuple, (17 more...)

arXiv.org Artificial Intelligence

2312.14291

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Austria > Vienna (0.14)
(15 more...)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Databases (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Characterizing and Classifying Developer Forum Posts with their Intentions

Wu, Xingfang, Laufer, Eric, Li, Heng, Khomh, Foutse, Srinivasan, Santhosh, Luo, Jayden

arXiv.org Artificial IntelligenceDec-21-2023

With the rapid growth of the developer community, the amount of posts on online technical forums has been growing rapidly, which poses difficulties for users to filter useful posts and find important information. Tags provide a concise feature dimension for users to locate their interested posts and for search engines to index the most relevant posts according to the queries. However, most tags are only focused on the technical perspective (e.g., program language, platform, tool). In most cases, forum posts in online developer communities reveal the author's intentions to solve a problem, ask for advice, share information, etc. The modeling of the intentions of posts can provide an extra dimension to the current tag taxonomy. By referencing previous studies and learning from industrial perspectives, we create a refined taxonomy for the intentions of technical forum posts. Through manual labeling and analysis on a sampled post dataset extracted from online forums, we understand the relevance between the constitution of posts (code, error messages) and their intentions. Furthermore, inspired by our manual study, we design a pre-trained transformer-based model to automatically predict post intentions. The best variant of our intention prediction framework, which achieves a Micro F1-score of 0.589, Top 1-3 accuracy of 62.6% to 87.8%, and an average AUC of 0.787, outperforms the state-of-the-art baseline approach. Our characterization and automated classification of forum posts regarding their intentions may help forum maintainers or third-party tool developers improve the organization and retrieval of posts on technical forums. We have released our annotated dataset and codes in our supplementary material package.

category, classifying developer forum post, intention, (14 more...)

arXiv.org Artificial Intelligence

2312.14279

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.93)

Add feedback

Hyperbolic Relevance Matching for Neural Keyphrase Extraction

Song, Mingyang, Feng, Yi, Jing, Liping

arXiv.org Artificial IntelligenceDec-21-2023

Keyphrase extraction is a fundamental task in natural language processing and information retrieval that aims to extract a set of phrases with important information from a source document. Identifying important keyphrase is the central component of the keyphrase extraction task, and its main challenge is how to represent information comprehensively and discriminate importance accurately. In this paper, to address these issues, we design a new hyperbolic matching model (HyperMatch) to represent phrases and documents in the same hyperbolic space and explicitly estimate the phrase-document relevance via the Poincar\'e distance as the important score of each phrase. Specifically, to capture the hierarchical syntactic and semantic structure information, HyperMatch takes advantage of the hidden representations in multiple layers of RoBERTa and integrates them as the word embeddings via an adaptive mixing layer. Meanwhile, considering the hierarchical structure hidden in the document, HyperMatch embeds both phrases and documents in the same hyperbolic space via a hyperbolic phrase encoder and a hyperbolic document encoder. This strategy can further enhance the estimation of phrase-document relevance due to the good properties of hyperbolic space. In this setting, the keyphrase extraction can be taken as a matching problem and effectively implemented by minimizing a hyperbolic margin-based triplet loss. Extensive experiments are conducted on six benchmarks and demonstrate that HyperMatch outperforms the state-of-the-art baselines.

hyperbolic space, keyphrase extraction, representation, (13 more...)

arXiv.org Artificial Intelligence

2205.02047

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Heilongjiang Province > Daqing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

dIR -- Discrete Information Retrieval: Conversational Search over Unstructured (and Structured) Data with Large Language Models

Bertorello, Pablo M. Rodriguez, Laguerre, Jean Rodmond Junior

arXiv.org Artificial IntelligenceDec-20-2023

Data is stored in both structured and unstructured form. Querying both, to power natural language conversations, is a challenge. This paper introduces dIR, Discrete Information Retrieval, providing a unified interface to query both free text and structured knowledge. Specifically, a Large Language Model (LLM) transforms text into expressive representation. After the text is extracted into columnar form, it can then be queried via a text-to-SQL Semantic Parser, with an LLM converting natural language into SQL. Where desired, such conversation may be effected by a multi-step reasoning conversational agent. We validate our approach via a proprietary question/answer data set, concluding that dIR makes a whole new class of queries on free text possible when compared to traditionally fine-tuned dense-embedding-model-based Information Retrieval (IR) and SQL-based Knowledge Bases (KB). For sufficiently complex queries, dIR can succeed where no other method stands a chance.

computational linguistic, information retrieval, query, (12 more...)

arXiv.org Artificial Intelligence

2312.13264

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Beijing > Beijing (0.04)
North America > Canada > Ontario > Toronto (0.04)
(8 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Detecting Technical Debt Using Natural Language Processing Approaches -- A Systematic Literature Review

Sutoyo, Edi, Capiluppi, Andrea

arXiv.org Artificial IntelligenceDec-19-2023

Context: Technical debt (TD) is a well-known metaphor for the long-term effects of architectural decisions in software development and the trade-off between producing high-quality, effective, and efficient code and meeting a release schedule. Thus, the code degrades and needs refactoring. A lack of resources, time, knowledge, or experience on the development team might cause TD in any software development project. Objective: In the context of TD detection, NLP has been utilized to identify the presence of TD automatically and even recognize specific types of TD. However, the enormous variety of feature extraction approaches and ML/DL algorithms employed in the literature often hinders researchers from trying to improve their performance. Method: In light of this, this SLR proposes a taxonomy of feature extraction techniques and algorithms used in technical debt detection: its objective is to compare and benchmark their performance in the examined studies. Results: We selected 55 articles that passed the quality evaluation of this SLR. We then investigated which feature extractions and algorithms were employed to identify TD in each SDLC phase. All approaches proposed in the analyzed studies were grouped into NLP, NLP+ML, and NLP+DL. This allows us to discuss the performance in three different ways. Conclusion: Overall, the NLP+DL group consistently outperforms in precision and F1-score for all projects, and in all but one project for the recall metric. Regarding the feature extraction techniques, the PTWE consistently achieves higher precision, recall, and F1-score for each project analyzed. Furthermore, TD types have been mapped, when possible, to SDLC phases: this served to determine the best-performing feature extractions and algorithms for each SDLC phase. Finally, based on the SLR results, we also identify implications that could be of concern to researchers and practitioners.

detection, f1-score, technical debt, (12 more...)

arXiv.org Artificial Intelligence

2312.1502

Country:

South America > Uruguay > Maldonado > Maldonado (0.05)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Europe > United Kingdom > England > Staffordshire (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback