Information Retrieval
A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream
Zafarani-Moattar, Elnaz, Kangavari, Mohammad Reza, Rahmani, Amir Masoud
Topic detection is a complex process and depends on language because it somehow needs to analyze text. There have been few studies on topic detection in Persian, and the existing algorithms are not remarkable. Therefore, we aimed to study topic detection in Persian. The objectives of this study are: 1) to conduct an extensive study on the best algorithms for topic detection, 2) to identify necessary adaptations to make these algorithms suitable for the Persian language, and 3) to evaluate their performance on Persian social network texts. To achieve these objectives, we have formulated two research questions: First, considering the lack of research in Persian, what modifications should be made to existing frameworks, especially those developed in English, to make them compatible with Persian? Second, how do these algorithms perform, and which one is superior? There are various topic detection methods that can be categorized into different categories. Frequent pattern and clustering are selected for this research, and a hybrid of both is proposed as a new category. Then, ten methods from these three categories are selected. All of them are re-implemented from scratch, changed, and adapted with Persian. These ten methods encompass different types of topic detection methods and have shown good performance in English. The text of Persian social network posts is used as the dataset. Additionally, a new multiclass evaluation criterion, called FS, is used in this paper for the first time in the field of topic detection. Approximately 1.4 billion tokens are processed during experiments. The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better. However, if the aim is to cluster posts for further analysis, the frequent pattern category is more suitable.
Query Complexity of Derivative-Free Optimization
This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same.
Accuracy at the Top
We introduce a new notion of classification accuracy based on the top -quantile values of a scoring function, a relevant criterion in a number of problems arising for search engines. We define an algorithm optimizing a convex surrogate of the corresponding loss, and discuss its solution in terms of a set of convex optimization problems. We also present margin-based guarantees for this algorithm based on the top -quantile value of the scores of the functions in the hypothesis set. Finally, we report the results of several experiments in the bipartite setting evaluating the performance of our solution and comparing the results to several other algorithms seeking high precision at the top. In most examples, our solution achieves a better performance in precision at the top.
Human memory search as a random walk in a semantic network
The human mind has a remarkable ability to store a vast amount of information in memory, and an even more remarkable ability to retrieve these experiences when needed. Understanding the representations and algorithms that underlie human memory search could potentially be useful in other information retrieval settings, including internet search. Psychological studies have revealed clear regularities in how people search their memory, with clusters of semantically related items tending to be retrieved together. These findings have recently been taken as evidence that human memory search is similar to animals foraging for food in patchy environments, with people making a rational decision to switch away from a cluster of related information as it becomes depleted. We demonstrate that the results that were taken as evidence for this account also emerge from a random walk on a semantic network, much like the random web surfer model used in internet search engines. This offers a simpler and more unified account of how people search their memory, postulating a single process rather than one process for exploring a cluster and one process for switching between clusters.
Contextual Clarity: Generating Sentences with Transformer Models using Context-Reverso Data
To create a dataset for training the T5 model, we harness the power of that provides usage examples for words. We prepared a dataset in the form of (query word, context or example usage) by parsing Context-Reverso webpages based on a query word. Additionally, we trained t5-small, and t5-base models for generating context-sentences based on input words. This resource enables us to obtain diverse and contextually rich sentences that incorporate the target keywords. We have also developed an application for learning new English words with a generated context [Telegram bot]. Our method aims to address the challenges of generating extremely short contexts and mitigating ambiguity in sentence construction. Objective: To develop a model that can generate informative and contextually relevant sentence-contexts for a given set of keywords, benefiting natural language understanding and generation applications such as search engines, personal assistants, and content summarization.
MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection
Li, Yupeng, He, Haorui, Bai, Jin, Wen, Dacheng
The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios.
Copeland Dueling Bandits Zohar Karnin Informatics Institute
A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of arms, while the second, Scalable Copeland Bandits (SCB), works better for large-scale problems. We provide theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results.
Science Checker Reloaded: A Bidirectional Paradigm for Transparency and Logical Reasoning
Rakotoson, Loïc, Massip, Sylvain, Laleye, Fréjus A. A.
Information retrieval is a rapidly evolving field. However it still faces significant limitations in the scientific and industrial vast amounts of information, such as semantic divergence and vocabulary gaps in sparse retrieval, low precision and lack of interpretability in semantic search, or hallucination and outdated information in generative models. In this paper, we introduce a two-block approach to tackle these hurdles for long documents. The first block enhances language understanding in sparse retrieval by query expansion to retrieve relevant documents. The second block deepens the result by providing comprehensive and informative answers to the complex question using only the information spread in the long document, enabling bidirectional engagement. At various stages of the pipeline, intermediate results are presented to users to facilitate understanding of the system's reasoning. We believe this bidirectional approach brings significant advancements in terms of transparency, logical thinking, and comprehensive understanding in the field of scientific information retrieval.
Foundation Models and Information Retrieval in Digital Pathology
The surge in adoption of digital pathology has the potential to revolutionize medical diagnosis by allowing computerized analysis of tissue images (Pantanowitz 2010; Aljanabi 2012; Hanna2020). Central to this technology is the digitization of formalin-fixed, paraffin-embedded (FFPE) tissue sections mounted on glass slides. This process converts physical tissue samples into high-resolution, gigapixel digital images called whole slide images (WSIs) (Kumar2020; Evans2022). These WSI files contain detailed patterns of tissue morphology, enabling the application of computer-vision algorithms in diagnostic pathology. Pathologists can now analyze tissue images seamlessly on computer screens at various magnifications (Griffin2017). This shift from light microscopes to digital displays allows for easier visual inspection of anatomic clues that may indicate specific diseases.