Goto

Collaborating Authors

 homomorphism




Ineffectiveness for Search and Undecidability of PCSP Meta-Problems

Larrauri, Alberto

arXiv.org Artificial Intelligence

It is an open question whether the search and decision versions of promise CSPs are equivalent. Most known algorithms for PCSPs solve only their \emph{decision} variant, and it is unknown whether they can be adapted to solve \emph{search} as well. The main approaches, called BLP, AIP and BLP+AIP, handle a PCSP by finding a solution to a relaxation of some integer program. We prove that rounding those solutions to a proper search certificate can be as hard as any problem in the class TFNP. In other words, these algorithms are ineffective for search. Building on the algebraic approach to PCSPs, we find sufficient conditions that imply ineffectiveness for search. Our tools are tailored to algorithms that are characterized by minions in a suitable way, and can also be used to prove undecidability results for meta-problems. This way, we show that the families of templates solvable via BLP, AIP, and BLP+AIP are undecidable. Using the same techniques we also analyze several algebraic conditions that are known to guarantee the tractability of finite-template CSPs. We prove that several meta-problems related to cyclic polymorphims and WNUs are undecidable for PCSPs. In particular, there is no algorithm deciding whether a finite PCSP template (1) admits cyclic a polymorphism, (2) admits a WNU.




UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

Firestone, Preston, Ugare, Shubham, Singh, Gagandeep, Misailovic, Sasa

arXiv.org Artificial Intelligence

Subword tokenization segments input text according to a pre-defined vocabulary to feed it into a language model; the language model, in turn, generates a sequence made from this same vocabulary. The members of the vocabulary can be built of code points or bytes. Using code points means that all members of the vocabulary are valid UTF-8 characters. However, it also requires thousands of initial members to achieve acceptable coverage of inputs. Beginning with bytes, on the contrary, avoids out-of-vocabulary errors with only 256 initial members of the vocabulary, but the members of the vocabulary and sequences of them are not guaranteed to be valid UTF-8. Sequences that are not valid UTF-8 break code that assumes its input to be valid UTF-8. Applications of language models must account for the breakage thereby introduced. In this paper, we formalize tokenization using monoid theory and prove that tokenizers whose vocabularies contain tokens that are ill-formed UTF-8 can always produce sequences that are ill-formed UTF-8. We demonstrate formally that attempting to incrementally convert tokens back to a string and interpret the results as UTF-8 gives different results than converting the whole sequence of tokens at once. This formal result predicts real-world bugs: we evaluate mitigations for the problem identified and provide case studies of major foundation models, serving engines, and constrained generation systems.


Message Passing on the Edge: Towards Scalable and Expressive GNNs

Barceló, Pablo, Jogl, Fabian, Kozachinskiy, Alexander, Lanzinger, Matthias, Neumann, Stefan, Rojas, Cristóbal

arXiv.org Artificial Intelligence

Graph learning, and in particular message-passing graph neural networks (GNNs), have emerged as a fundamental tool across the sciences (Scarselli et al., 2009; Kipf & Welling, 2017; Wu et al., 2019). A major factor in the wide applicability of GNNs is the robust theoretical framework that grounds their study in principled comparisons between architectures. Core to this framework is the characterization of the expressive power of GNNs in terms of the Weisfeiler-Leman (1WL) test, a central notion in the theoretical study of graph similarity (Morris et al., 2019; Xu et al., 2019). Owing to this connection, alternative, yet equally insightful, characterizations of GNN expressive power have been developed--such as via finite-variable fragments of counting logics (Cai et al., 1992) or homomorphism counts from classes of graphs of bounded treewidth (Dvor ak, 2010; Dell et al., 2018). While these perspectives enrich our understanding of GNN capabilities, they also make clear that 1WL-based GNNs remain limited in important ways. To highlight the necessity for more expressive tests than 1WL, an illustrative limitation of 1WL-based GNNs is their inability to detect small motifs or count triangles (Chen et al., 2020; Arvind et al., 2020; Lanzinger & Barcel o, 2024). The higher-order WL hierarchy offers a systematic remedy: properties that elude 1WL are often captured at some higher level, i.e., by kWL for some k > 1. Yet this increase in expressive power comes at a steep computational cost.