Goto

Collaborating Authors

 Information Retrieval


C-SEO Bench: Does Conversational SEO Work?

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not know whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are not only largely ineffective but also frequently have a negative impact on document ranking, which is opposite to what is expected. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at https://github.com/parameterlab/c-seo-bench and https://huggingface.co/datasets/parameterlab/c-seo-bench.


Lean Finder: Semantic Search for Mathlib That Understands User Intents

arXiv.org Artificial Intelligence

We present Lean Finder, a semantic search engine for Lean and mathlib that understands and aligns with the intents of mathematicians. We further align Lean Finder with mathematicians' preferences using In addition, Lean Finder is compatible with LLM-based theorem provers, bridging retrieval with formal reasoning. Advances in Lean and mathlib (De Moura et al., 2015; Moura & Ullrich, 2021) are turning mathematical discovery into a collaborative and verifiable research workflow. Despite these advances, state-of-the-art LLMs still cannot solve math research problems. Lean's syn tax, gram mar, and tac tics in cur a steep learn ing curve. All experiments and data processing were conducted outside Meta. Figure 1: In the evaluation with user queries, real users preferred Lean Finder in 81.6% of cases, compared with Consider the two queries below. Lean search engines handle (Gao et al., 2024a;b; Ju & Dong, 2025; Asher, 2025): Denote L/K a field extension, x, y in L are algebraic elements over K with the same minimal polynomial. I'm working with algebraic elements over a field extension and I have two elements, say x and y in L. I know x is algebraic over K, and I've shown that y is a root of the minimal polynomial of x. Does this imply that the minimal polynomials of x and y are actually equal? T arget Statement 2: 1 theorem eq_of_root {x y: L} (hx: IsAlgebraic K x) (h_ev: Polynomial.aeval y (minpoly K x) = 0): minpoly K y = minpoly K x):= -- proof omitted for brevity This user latent (motivation, perspective, abstraction) cannot be inferred or encoded by a purely syntactic informalization. Addressing this challenge calls for Lean search engines that can understand a mathematician's intent, not merely We defer a more rigorous analysis in Section 2.2, and ask our core question: Our approach analyzes and clusters public discussions, then synthesizes queries that simulate user intents (Section 3.1).


MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries

arXiv.org Artificial Intelligence

Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce MiLQ, Mixed-Language Query test set, the first public benchmark of mixed-language queries, qualified as realistic and relatively preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently across native, English, and mixed-language queries, also suggesting code-switched training data's potential for robust IR models handling such queries. Meanwhile, intentional English mixing in queries proves an effective strategy for bilinguals searching English documents, which our analysis attributes to enhanced token matching compared to native queries.


Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings

arXiv.org Artificial Intelligence

The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.


Near-Optimal Quantum Algorithms for Computing (Coarse) Correlated Equilibria of General-Sum Games

arXiv.org Artificial Intelligence

Computing Nash equilibria of zero-sum games in classical and quantum settings is extensively studied. For general-sum games, computing Nash equilibria is PPAD-hard and the computing of a more general concept called correlated equilibria has been widely explored in game theory. In this paper, we initiate the study of quantum algorithms for computing $\varepsilon$-approximate correlated equilibria (CE) and coarse correlated equilibria (CCE) in multi-player normal-form games. Our approach utilizes quantum improvements to the multi-scale Multiplicative Weight Update (MWU) method for CE calculations, achieving a query complexity of $\tilde{O}(m\sqrt{n})$ for fixed $\varepsilon$. For CCE, we extend techniques from quantum algorithms for zero-sum games to multi-player settings, achieving query complexity $\tilde{O}(m\sqrt{n}/\varepsilon^{2.5})$. Both algorithms demonstrate a near-optimal scaling in the number of players $m$ and actions $n$, as confirmed by our quantum query lower bounds.


Blending Learning to Rank and Dense Representations for Efficient and Effective Cascades

arXiv.org Artificial Intelligence

We investigate the exploitation of both lexical and neural relevance signals for ad-hoc passage retrieval. Our exploration involves a large-scale training dataset in which dense neural representations of MS-MARCO queries and passages are complemented and integrated with 253 hand-crafted lexical features extracted from the same corpus. Blending of the relevance signals from the two different groups of features is learned by a classical Learning-to-Rank (LTR) model based on a forest of decision trees. To evaluate our solution, we employ a pipelined architecture where a dense neural retriever serves as the first stage and performs a nearest-neighbor search over the neural representations of the documents. Our LTR model acts instead as the second stage that re-ranks the set of candidates retrieved by the first stage to enhance effectiveness. The results of reproducible experiments conducted with state-of-the-art dense retrievers on publicly available resources show that the proposed solution significantly enhances the end-to-end ranking performance while relatively minimally impacting efficiency. Specifically, we achieve a boost in nDCG@10 of up to 11% with an increase in average query latency of only 4.3%. This confirms the advantage of seamlessly combining two distinct families of signals that mutually contribute to retrieval effectiveness.


AI Guided Accelerator For Search Experience

arXiv.org Artificial Intelligence

Effective query reformulation is pivotal in narrowing the gap between a user's exploratory search behavior and the identification of relevant products in e-commerce environments. While traditional approaches predominantly model query rewrites as isolated pairs, they often fail to capture the sequential and transitional dynamics inherent in real-world user behavior. In this work, we propose a novel framework that explicitly models transitional queries--intermediate reformulations occurring during the user's journey toward their final purchase intent. By mining structured query trajectories from eBay's large-scale user interaction logs, we reconstruct query sequences that reflect shifts in intent while preserving semantic coherence. This approach allows us to model a user's shopping funnel, where mid-journey transitions reflect exploratory behavior and intent refinement. Furthermore, we incorporate generative Large Language Models (LLMs) to produce semantically diverse and intent-preserving alternative queries, extending beyond what can be derived through collaborative filtering alone. These reformulations can be leveraged to populate Related Searches or to power intent-clustered carousels on the search results page, enhancing both discovery and engagement. Our contributions include (i) the formal identification and modeling of transitional queries, (ii) the introduction of a structured query sequence mining pipeline for intent flow understanding, and (iii) the application of LLMs for scalable, intent-aware query expansion. Empirical evaluation demonstrates measurable gains in conversion and engagement metrics compared to the existing Related Searches module, validating the effectiveness of our approach in real-world e-commerce settings.


MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

arXiv.org Artificial Intelligence

This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin


Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

arXiv.org Artificial Intelligence

We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.


LTR-ICD: A Learning-to-Rank Approach for Automatic ICD Coding

arXiv.org Artificial Intelligence

Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.