Search
Locally-Adaptive Quantization for Streaming Vector Search
Aguerrebere, Cecilia, Hildebrand, Mark, Bhati, Ishwar Singh, Willke, Theodore, Tepper, Mariano
Retrieving the most similar vector embeddings to a given query among a massive collection of vectors has long been a key component of countless real-world applications. The recently introduced Retrieval-Augmented Generation is one of the most prominent examples. For many of these applications, the database evolves over time by inserting new data and removing outdated data. In these cases, the retrieval problem is known as streaming similarity search. While Locally-Adaptive Vector Quantization (LVQ), a highly efficient vector compression method, yields state-of-the-art search performance for non-evolving databases, its usefulness in the streaming setting has not been yet established. In this work, we study LVQ in streaming similarity search. In support of our evaluation, we introduce two improvements of LVQ: Turbo LVQ and multi-means LVQ that boost its search performance by up to 28% and 27%, respectively. Our studies show that LVQ and its new variants enable blazing fast vector search, outperforming its closest competitor by up to 9.4x for identically distributed data and by up to 8.8x under the challenging scenario of data distribution shifts (i.e., where the statistical distribution of the data changes over time). We release our contributions as part of Scalable Vector Search, an open-source library for high-performance similarity search.
Long-range Meta-path Search on Large-scale Heterogeneous Graphs
Li, Chao, Guo, Zijie, He, Qiuting, Xu, Hao, He, Kun
Utilizing long-range dependency, though extensively studied in homogeneous graphs, has not been well investigated on heterogeneous graphs. Addressing this research gap presents two major challenges. The first is to alleviate computational costs while endeavoring to leverage as much effective information as possible in the presence of heterogeneity. The second involves overcoming the well-known over-smoothing issue occurring in various graph neural networks. To this end, we investigate the importance of different meta-paths and introduce an automatic framework for utilizing long-range dependency on heterogeneous graphs, denoted as Long-range Meta-path Search through Progressive Sampling (LMSPS). Specifically, we develop a search space with all meta-paths related to the target node type. By employing a progressive sampling algorithm, LMSPS dynamically shrinks the search space with hop-independent time complexity. Utilizing a sampling evaluation strategy as the guidance, LMSPS conducts a specialized and effective meta-path selection. Subsequently, only effective meta-paths are employed for retraining to reduce costs and overcome the over-smoothing issue. Extensive experiments on various heterogeneous datasets demonstrate that LMSPS discovers effective long-range meta-paths and outperforms the state-of-the-art. Besides, it ranks top-1 on the leaderboards of \texttt{ogbn-mag} in Open Graph Benchmark. Our code is available at https://github.com/JHL-HUST/LDMLP.
Minimizing Regret in Billboard Advertisement under Zonal Influence Constraint
Ali, Dildar, Banerjee, Suman, Prasad, Yamuna
In a typical billboard advertisement technique, a number of digital billboards are owned by an influence provider, and many advertisers approach the influence provider for a specific number of views of their advertisement content on a payment basis. If the influence provider provides the demanded or more influence, then he will receive the full payment or else a partial payment. In the context of an influence provider, if he provides more or less than an advertiser's demanded influence, it is a loss for him. This is formalized as 'Regret', and naturally, in the context of the influence provider, the goal will be to allocate the billboard slots among the advertisers such that the total regret is minimized. In this paper, we study this problem as a discrete optimization problem and propose four solution approaches. The first one selects the billboard slots from the available ones in an incremental greedy manner, and we call this method the Budget Effective Greedy approach. In the second one, we introduce randomness with the first one, where we perform the marginal gain computation for a sample of randomly chosen billboard slots. The remaining two approaches are further improvements over the second one. We analyze all the algorithms to understand their time and space complexity. We implement them with real-life trajectory and billboard datasets and conduct a number of experiments. It has been observed that the randomized budget effective greedy approach takes reasonable computational time while minimizing the regret.
ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution
Ye, Haoran, Wang, Jiarui, Cao, Zhiguang, Song, Guojie
The omnipresence of NP-hard combinatorial optimization problems (COPs) compels domain experts to engage in trial-and-error heuristic design process. The long-standing endeavor of design automation has gained new momentum with the rise of large language models (LLMs). This paper introduces Language Hyper-Heuristics (LHHs), an emerging variant of Hyper-Heuristics that leverages LLMs for heuristic generation, featuring minimal manual intervention and open-ended heuristic spaces. To empower LHHs, we present Reflective Evolution (ReEvo), a generic searching framework that emulates the reflective design approach of human experts while far surpassing human capabilities with its scalable LLM inference, Internet-scale domain knowledge, and powerful evolutionary search. Evaluations across 12 COP settings show that 1) verbal reflections for evolution lead to smoother fitness landscapes, explicit inference of black-box COP settings, and better search results; 2) heuristics generated by ReEvo in minutes can outperform state-of-the-art human designs and neural solvers; 3) LHHs enable efficient algorithm design automation even when challenged with black-box COPs, demonstrating its potential for complex and novel real-world applications. Our code is available: https://github.com/ai4co/LLM-as-HH.
A Memetic Algorithm To Find a Hamiltonian Cycle in a Hamiltonian Graph
We present a memetic algorithm (\maa) approach for finding a Hamiltonian cycle in a Hamiltonian graph. The \ma is based on a proven approach to the Asymmetric Travelling Salesman Problem (\atspp) that, in this contribution, is boosted by the introduction of more powerful local searches. Our approach also introduces a novel technique that sparsifies the input graph under consideration for Hamiltonicity and dynamically augments it during the search. Such a combined heuristic approach helps to prove Hamiltonicity by finding a Hamiltonian cycle in less time. In addition, we also employ a recently introduced polynomial-time reduction from the \hamcyc to the Symmetric \tsp, which is based on computing the transitive closure of the graph. Although our approach is a metaheuristic, i.e., it does not give a theoretical guarantee for finding a Hamiltonian cycle, we have observed that the method is successful in practice in verifying the Hamiltonicity of a larger number of instances from the \textit{Flinder University Hamiltonian Cycle Problem Challenge Set} (\fhcpsc), even for the graphs that have large treewidth. The experiments on the \fhcpscc instances and a computational comparison with five recent state-of-the-art baseline approaches show that the proposed method outperforms those for the majority of the instances in the \fhcpsc.
Approximate Nearest Neighbor Search with Window Filters
Engels, Joshua, Landrum, Benjamin, Yu, Shangdi, Dhulipala, Laxman, Shun, Julian
The nearest neighbor search problem has been widely studied for more than 30 years (Arya & Mount, 1993). Given Although this problem has many motivating examples, there a dataset D, the problem requires the construction of an is a dearth of papers examining it in the literature. Some vector index that can efficiently answer queries of the form "what databases analyze window search-like problem instances is the closest vector to x in D?" Solving this problem exactly as an additional feature of their system, but this analysis degrades to a brute force linear search in high dimensions is typically secondary to their main approach and too slow (Rubinstein, 2018), so instead both theoreticians and for large-scale real-world systems; as far as we are aware, practitioners focus on the relaxed c-approximate nearest we are the first to propose, analyze, and experiment with a neighbor search problem (ANNS), which asks "what is a non-trivial solution to the window search problem.
FM3Q: Factorized Multi-Agent MiniMax Q-Learning for Two-Team Zero-Sum Markov Game
Hu, Guangzheng, Zhu, Yuanheng, Li, Haoran, Zhao, Dongbin
Many real-world applications involve some agents that fall into two teams, with payoffs that are equal within the same team but of opposite sign across the opponent team. The so-called two-team zero-sum Markov games (2t0sMGs) can be resolved with reinforcement learning in recent years. However, existing methods are thus inefficient in light of insufficient consideration of intra-team credit assignment, data utilization and computational intractability. In this paper, we propose the individual-global-minimax (IGMM) principle to ensure the coherence between two-team minimax behaviors and the individual greedy behaviors through Q functions in 2t0sMGs. Based on it, we present a novel multi-agent reinforcement learning framework, Factorized Multi-Agent MiniMax Q-Learning (FM3Q), which can factorize the joint minimax Q function into individual ones and iteratively solve for the IGMM-satisfied minimax Q functions for 2t0sMGs. Moreover, an online learning algorithm with neural networks is proposed to implement FM3Q and obtain the deterministic and decentralized minimax policies for two-team players. A theoretical analysis is provided to prove the convergence of FM3Q. Empirically, we use three environments to evaluate the learning efficiency and final performance of FM3Q and show its superiority on 2t0sMGs.
Genetic-based Constraint Programming for Resource Constrained Job Scheduling
Nguyen, Su, Thiruvady, Dhananjay, Sun, Yuan, Zhang, Mengjie
Resource constrained job scheduling is a hard combinatorial optimisation problem that originates in the mining industry. Off-the-shelf solvers cannot solve this problem satisfactorily in reasonable timeframes, while other solution methods such as many evolutionary computation methods and matheuristics cannot guarantee optimality and require low-level customisation and specialised heuristics to be effective. This paper addresses this gap by proposing a genetic programming algorithm to discover efficient search strategies of constraint programming for resource-constrained job scheduling. In the proposed algorithm, evolved programs represent variable selectors to be used in the search process of constraint programming, and their fitness is determined by the quality of solutions obtained for training instances. The novelties of this algorithm are (1) a new representation of variable selectors, (2) a new fitness evaluation scheme, and (3) a pre-selection mechanism. Tests with a large set of random and benchmark instances, the evolved variable selectors can significantly improve the efficiency of constraining programming. Compared to highly customised metaheuristics and hybrid algorithms, evolved variable selectors can help constraint programming identify quality solutions faster and proving optimality is possible if sufficiently large run-times are allowed. The evolved variable selectors are especially helpful when solving instances with large numbers of machines.
Comprehensive Exploration of Synthetic Data Generation: A Survey
Bauer, André, Trapp, Simon, Stenger, Michael, Leppich, Robert, Kounev, Samuel, Leznik, Mark, Chard, Kyle, Foster, Ian
Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthetic Data Generation (SDG) models over the last decade, providing a comprehensive overview of model types, functionality, and improvements. Common attributes are identified, leading to a classification and trend analysis. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing, except for privacy-preserving data generation. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete. Implications from our performance evaluation highlight the scarcity of common metrics and datasets, making comparisons challenging. Additionally, the neglect of training and computational costs in literature necessitates attention in future research. This work serves as a guide for SDG model selection and identifies crucial areas for future exploration.
Evolutionary Tabletop Game Design: A Case Study in the Risk Game
Rossato, Lana Bertoldo, Bombardelli, Leonardo Boaventura, Tavares, Anderson Rocha
Creating and evaluating games manually is an arduous and laborious task. Procedural content generation can aid by creating game artifacts, but usually not an entire game. Evolutionary game design, which combines evolutionary algorithms with automated playtesting, has been used to create novel board games with simple equipment; however, the original approach does not include complex tabletop games with dice, cards, and maps. This work proposes an extension of the approach for tabletop games, evaluating the process by generating variants of Risk, a military strategy game where players must conquer map territories to win. We achieved this using a genetic algorithm to evolve the chosen parameters, as well as a rules-based agent to test the games and a variety of quality criteria to evaluate the new variations generated. Our results show the creation of new variations of the original game with smaller maps, resulting in shorter matches. Also, the variants produce more balanced matches, maintaining the usual drama. We also identified limitations in the process, where, in many cases, where the objective function was correctly pursued, but the generated games were nearly trivial. This work paves the way towards promising research regarding the use of evolutionary game design beyond classic board games.